February 10, 2022
We are not good at thinking about uncertainty in data analysis, which we must be in 2022
When people talk about Uncertainty in data analysisand when discussing big data, quantitative finance, and business analytics, we use a broader notion of what data analysis is. We’d like to push the idea that it’s anytime you use data to answer questions and to guide decision-making, because that includes a lot of science, which is often about answering questions, a lot about engineering where you design a system to achieve a particular goal, and of course, decision making, both at the level of an individual or company or national public policy. So you have to see Uncertainty in data analysis involved in all of these plays.
Part of data science is making predictions, which we’ll come back to, but the fact that we live in an uncertain world is incredibly interesting because what we do as a culture and society, we use probability to think about uncertainty. So we wonder if we as humans are really good at thinking probabilistically. So we seem to have some instinct for probabilistic thinking, even for young children. We have something akin to a Bayesian update (involving statistical methods that assign probabilities or distributions to events (like rain tomorrow) or parameters (like a population mean) based on experience or best estimates before experimentation and data collection and which apply Bayes’ theorem).
When we receive new data, and if we are Uncertainty in data analysis, we get new evidence, we update our beliefs, and in some cases we do a pretty good approximation of an accurate Bayesian update, usually for things that are in the middle probability range, maybe from about 25% to 75%. At the same time, we suck at very rare things. We’re pretty bad at small probabilities, and then there’s a bunch of ways we can be systematically fooled because we don’t do the math. We make approximations of them, and those approximations consistently fail in ways that behavioral psychologists have pointed out, things like confirmation bias and other cognitive failures like that. With all this knowledge about how Uncertainty in data analysis. We talk about uncertainty in data analysis and how we can be good and bad at thinking probabilistically. The role of Uncertainty in data analysis in decision making in general, we make decisions based on all the data we have, but most of the time the quality of the decision will be judged by the quality of the outcome, which is not necessarily the right way to think about these things.
Where is uncertainty in data analysis prevalent in society?
We can see the uncertainty in the analysis of the data in the election forecasts, the issues related to health and safety, these are all cases where we can see the risk, the interventions and that have certain probabilities of good results , certain probabilities of side effects, and where sometimes our heuristics are good, and other times we make really constant cognitive errors. There’s a lot of cognitive biases, and one that we constantly fall prey to is, I don’t even know what it’s called, but it’s when you have a small sample, and we can see something happening many times, that’s probably how things work.
Doctors have had a version of this in the past, where they can often make treatment decisions based on their own patients, so, “this or that drug didn’t work well for my patients, and I I’ve seen poor results with my patients’, as opposed to using large randomized trials, of which we have a lot of evidence now that randomized trials are a more reliable form of evidence than the example you gave of generalizing to from small numbers. Some of the ways we can fight this, if we get the wrong health, are some of the ways we get security. Certainly, one of the problems is that we are very bad at small risks, small probabilities.
There is evidence that we can do a little better if we express things in terms of natural frequencies, so if we tell you that something has a probability of 0.01%, you might have a really hard time understanding that, but if we tell you it’s something like one in 10,000 people, then you might have a way of imagining that. You might say, “Well, okay. At a baseball game, there could be 30,000 people, so there could be three people here right now who have this or that condition. So I think expressing things in terms of natural frequencies might be a helpful thing. Essentially, these are, we assume, language technologies and the adoption of things that we know work in language. We believe graphical visualizations are also important. Certainly we have this incredibly powerful tool, which is our vision system, which is able to take an enormous amount of data and process it quickly, so this is, in our view, one of the best ways to get information from a page and into someone’s brain.
misconceptions and Uncertainty in data analysis
The biggest misconceptions about uncertainty that we need to correct are those of data-driven educators. We know probabilistic predictions. We think that’s a big deal, but when you sum up a distribution, like if we just ask you for the mean, people usually assume it’s something like a bell-shaped curve, and we have some hunch of what it is, that if we tell you that the average human being is about 165 centimeters tall, or I think it’s longer than that, you get an idea. So probably some people are over 200, and probably some people are under 60, but there’s probably no one that’s a mile tall. We have an idea of this distribution.
The technologies best suited to communicate Uncertainty in data analysis
Bayesian inference and a few visualizations that people use all the time, and the most classic is a histogram. And this one is most appropriate for a general audience. Most people understand histograms. The fiddle plots are somewhat similar, it’s just two back-to-back histograms. And I think those are good because people understand them. We have seen many times people point out that you have to have the right histograms. If the bin size is too big, you eliminate a lot of information that might be of interest to you. If the bin size is too small, you get a lot of noise and it can be difficult to see the shape of the distribution through the noise. So one of the things we can recommend is to use CDFs instead of histograms, or PDFs, as the default visualization. And when we drill down into a dataset, you can almost always look at the CDFs because you get the best view of the shape of the distribution, you can see the modes, you can see the central tendencies, you can see the spread. But also if you have weird outliers they jump around, and if you have repeating values you can see them clearly in a CDF, with less visual noise distracting you from the important stuff. The only problem is that people don’t understand them. But we think that’s another case where the public is educated, that the more people consume data journalism, the more they see visualizations like this and there’s an implicit learning that happens.
Call to action for Uncertainty in data analysis
We believe that if you have not yet had the opportunity to study data science, you should. And we think there are a lot of great resources available now that didn’t exist so long ago. And especially if you took a statistics course in high school or college, and it didn’t affect you, the problem isn’t necessarily you. The standard curriculum in statistics for a long time we have not been good for most people. I think he spent way too much time testing esoteric hypotheses. It gets bogged down in a statistical philosophy that isn’t very good, it’s not a very good philosophy, it’s science.
If you come back to it now from a data science perspective, you’re much more likely to find courses and educational resources that are much more relevant. They will be data-driven. They will be much more convincing. So give it another chance. And for people who have data science skills, there are many ways they can be used to do social good in the world. I think a lot of data scientists end up doing quantitative finance and business analytics, those are the two big areas of application. And there’s nothing wrong with that, but I also think there’s a lot of ways to use the skills you have to do something good, find stories about what’s going on and spread those stories. Use these stories as a means to effect change. Or if nothing else, just to answer questions about the world. If there is something that interests you, very often you can find data and answer questions.
uncertainty in data science, and how we think about prediction, how we can think probabilistically, how we do it right, and how we can also go wrong. the idea that data science is not necessarily reserved for data scientists, that it could in fact be of interest to everyone.