Facebook, Google, and twitter lawyers gave testimony to congress on how they missed the Russian influence campaign. Even though the ads were bought in Russian currency on platforms chalk full of analytics engines, the problematic nature of the influence campaign went undetected. “Rubles + US politics” did not trigger an alert , because the nature of off-the-shelf deep learning is that it only looks for what it knows to look for, and on a deeper level, it is learning from really messy (unstructured) or corrupted and biased data. Understanding the unstructured nature of public data (mixed with private data) is improving by leaps and bounds every day. That’s one of the main things I work on. Let’s focus instead on the data quality problem.
Here are a few of the many common data quality problems:
- Data sparsity: We know a bit of the picture about a lot of things, but no clear picture on most things.
- Data corruption: Convert a PDF to text and print it. Yeah. Lots of garbage comes out besides the text.
- Lots of irrelevant data: In a chess game, we can prune whole sections of the tree search, and more generally, in a picture of a cat, most of the pixels don’t tell us how cute the cat is. In totally random data, we humans (and AI) can see patterns where there really is none.
- Learning from bad labeling: Bias of the labeling system, possibly due to human bias.
- Missing unexpected patterns: Black swans, regime change, class imbalance, etc.
- Learning wrong patterns: Correlation that is not really causation can be trained into an AI, which then assumes wrongly that the correlation is causative.
- I could go on.
We know that labelled data is really hard to come by for basically any problem, and even labelled data can be full of bias. I visited a prospective client on Friday that had a great data team but no ability to collect the data they needed from the real world because of ownership and IP issues. This “Rubles + US politics” example of good data that is missed by AI is not surprising to experts. Why? Well, AI needs to know what to look for, and the social media giants were looking for more aggressive types of attacks like monitoring soldier’s movements based on their facebook profiles. Indeed, the reason we miss signals from good data in general is the huge amount of BAD data in real systems like twitter. This is a signal to noise ratio problem. If there are too many alerts, the alert system is ignored. Too few, and the system misses critical alerts. It is not only adversaries like the Russians trying to gain influence. The good guys, companies and brands, do the same thing. Drip campaigns and guerrilla marketing are just as much a tactic for spreading influence in shoe sales as in political meddling in an election. So, the real reason we miss signals from good data is bad data. Using simple predicate logic, we know that False assumptions can imply anything ( also this ). So learning from data we know is error-riddled carries some real baggage.
One example of bad data is finding that your AI model was trained on the wrong type of data. The text from chat conversation is not like text from a newspaper. Both are composed of text, but their content is very different. AI trained on the Wikipedia dataset or Google News articles will not correctly understand (i.e. “model”) the free-form text we humans use to communicate in chat applications. Here is a slightly better dataset for that, and maybe the comments from the hackernews dataset too. Often we need to use the right pre-trained model or off the shelf dataset for the right problem, and then do some transfer learning to improve from the baseline. However, this assumes we can use the data at all. many public datasets have even bigger bad data problems that cause the model to simply fail. Sometimes a field is used and sometimes it is left blank (sparsity), Sometimes non-numeric data creeps into numerical columns (“one” vs 1). I found an outlier in a large private real estate dataset where one entry among a million was a huge number entered by a human as a fat finger error .
Problems like the game of go ( AlphaGo zero ) has no bad data to analyze. Instead the AI evaluates more relevant and less relevant data. Games are a nice constrained problem set, but in most real world data, there is bias . Lots of it. Boosting and other techniques can be helpful too. The truth is that some aspects of machine learning are still open problems, and shocking improvements happen all the time. Example: Capsule network beats CNN .
It is important to know when error is caused by bad things in the data rather than caused by improperly fitting to the data. And live systems that learn while they operate, like humans do, are particularly susceptible to learning wrong information from bad data. This is kind of like Simpson’s paradox , in that the data is usually right, and so fitting the data is a good thing, but sometimes fitting to the data produces paradoxes because the method itself (fitting to the data) is based on a bad assumption that all data is ground truth data. See this video for more on Simpson’s paradox fun. And here is another link to Autodesk’s datasaurus , which I just love. It is totally worth reading in full.
We talked about the fact that most real-world data is full of corruption and bias. That kind of sucks, but not all is lost. There are a variety of techniques for combating bad data quality, not the least of which are collecting more data, and cleaning up the data. More advanced techniques like ensembles with NLP, knowledge graphs and commercial-grade analytics are not easy to get your hands on. More on this in future articles.
If you enjoyed this article on bad data and artificial intelligence, then please try out the clap tool . Tap that. Follow us on medium. Share on Facebook and twitter. Go for it. I’m also happy to hear your feedback in the comments. What do you think?
-Daniel email@example.com ← Say hi. Lemay.ai 1(855)LEMAY-AI
Other articles you may enjoy:
- How to Price an AI Project
- How to Hire an AI Consultant
- Artificial Intelligence: Get your users to label your data
This content was originally published here.