The data classification problem
Here’s a seemingly simple question. How would you find all of the names in a file? There are a lot of potential answers to this question: use a classification model, use some sort of hard coded search, try to find words with the correct capitalization, etc. – but none of these are one size fits all. A model that finds all of the names in a book may not succeed at finding all of the names in a HTML website. A hard coded search doesn’t work if your data is from a different country with different names. Capitalization isn’t guaranteed if the file is a list of names from a survey, normalized to all be lower case. No one approach to private information detection works for every kind of data you could ever encounter. Even if you have an approach that you consider to be decent, it probably works significantly better on some data types than others.
How we solve it differently
As an engineer on the Granica Screen team, I’ve learned to be very observant about data. Every file contains useful patterns, even “unstructured” data - otherwise, it would just be gibberish. For example, even plain text data can be natural language, a bulleted list, or a completely un-grammatical word dump. Instead of ignoring these patterns and approaching all data with the same strategy, Screen is set apart by its robust ability to take into account what it’s looking at and adjust accordingly.
Some data is written by a human to be understood by humans, not machines. Take a look at this blog post. How would you build something that identifies who wrote it? It doesn’t structurally telegraph private information as private information, but requires understanding the meaning of the words in the blog post. This is a classic natural language processing (NLP) based detection problem. Many simple solutions leverage NLP approaches to identify private information, and Screen does this well. Like our competitors, we are capable of parsing private information from natural language text and other data designed for human consumption.
However, not all data is human-focused. Sometimes, the data is made for computers. It’s hard for a tool focused on natural language data to detect a name in a sparse file that doesn’t contain any natural language. ML models built for text classification purposes rely primarily on learned grammar from datasets such as news articles, websites, and books. These models fall short when they encounter data that doesn’t contain the same patterns as the training data. Websites, JSON files, bulleted lists, tabular data, and other semi-structured data don’t look anything like your standard essay or news article, so any model trained on those sources will struggle at identifying names in such data types. It’s much easier if your product is smart enough to say “oh hey, this file is a JSON that says “name”: “John Smith” so John Smith must be a name.” We can use our understanding of how the data is formatted to infer what each token represents.
Now that we have a basic understanding of the differences between approaches used in unstructured data vs. semi-structured or structured data, we can see where the lines between those categories of data can sometimes blur. For example, take a news website. Most of the text on the website is a news article. However, the article itself isn’t presented in one nice natural language block. It’s split up by HTML tags that determine how the page displays online. A model trained on news articles would be able to see “We interviewed John Smith about Granica.” and pick up “John Smith” as a name. That same model might struggle a lot more if it saw:
We interviewed <a href=“link to an article about John” alt=“John Smith article”>John Smith</a> about Granica.
If we try to treat this as a news article, the tags get in the way. If we detect and remove the tags, the alt text in the tag containing John’s name is ignored and that name is missed. As a result, the best option is actually a 2-pronged approach, treating it both as HTML data that can be parsed using HTML parsing and natural language text data that can be parsed using a natural language model.
Another case where we have to treat semi-structured data in a nuanced way is with data that wraps other data. If we look at a JSON file containing the number of times each link was clicked in the website described above, we could see something like:
“text”: “<p>We interviewed <a href=“link to an article about John” alt=“John Smith article”>John Smith</a> about Granica.</p>”,
“link”: “John Smith”,
“url”: “link to an article about John”,
“num clicks”: 5,
Here, not only do we need to detect John’s name in its original context, we also need to recognize it in a context where it is not name-like at all. If we only parse the JSON as key-value pairs, “John Smith” is labeled as a link. A link can’t be a name, can it? But in this case, it is, and the key lies in the full meaning of the document. John Smith does refer to a link, but that link provides context about a person named John Smith. Similarly to the problem above, a naive look at the closest context doesn’t give us the correct read on where the name is, but just throwing a natural language model at the whole thing wouldn’t yield good results either. Instead, we need to be able to break down the separate layers of this data in order to determine the meaning of each individual piece. Only then can we accurately determine the location of the name.
Our ability to detect patterns in data, use those patterns to gain insight into what each component represents, and use that information to detect the precise location of PII is one of Screen’s biggest strengths. This not only improves precision and accuracy, but speed as well. By preprocessing our data using the patterns of structured data to our advantage, we can reduce the amount of data put through resource-intensive processes like ML models and complicated searches. This nuanced view of data and the ability to truly break it down into meaningful pieces is what sets Screen apart.
We’ve seen that it works
Currently, Granica Screen supports a large variety of file types, and we are actively working on expanding our capabilities. Across these file types, we are able to demonstrate superior performance compared to our larger competitors. Through our unique approach, we are able to provide highly accurate results in a quicker, more cost efficient, and more scalable manner. We can detect PII in the strangest of places, such as in comments within HTML tags, or deeply nested JSON. I’m amazed that we’ve managed to accomplish this as a small team compared to industry giants. We’ve managed to innovate and create smarter solutions for data privacy through Granica Screen, and I’m very excited to share this technology with more customers and improve data privacy and security everywhere we go.
Interested to learn more? Book a Demo with our experts today.
Thoughts? Leave a comment below: