Getting an Indecent Education: The Joys of Naked Statistics

By: Yitzchak Fried | Features | March 28, 2016

Often enough, statistics can seem like science’s mystifying tool for disproving common sense. Both the natural and social sciences employ statistical methods, and their findings make headlines and form the basis of the latest scientific and popular orthodoxies. That’s why there was such a commotion when, last year, the Reproducibility Project found that 60 out of 100 studies published in leading psychology journals yielded different results the second time around. This bombshell was deemed a “replication crisis”. Debate now rages between top psychologists over the field’s methodological soundness and the direction of its future. To a great extent, the discussion comes down to the power and limitations of statistical inference. As Yale psychologist Paul Bloom wrote in his article, “Psychology’s Replication Crisis Has a Silver Lining[1]”: “[M]uch of psychologists’ standard operating procedure—our style of collecting data, analyzing our results, reporting our findings, and deciding what to submit to publication—is biased toward “false positives,” where random effects are reported as significant findings. Too many of us engage in “p-hacking,”…where we rummage through our data looking for statistically significant findings, and then, with the most innocent of intentions, convince ourselves that these findings are precisely what we predicted in the first place.”

For the conscientious consumer of modern science, it’s natural to have some curiosity about what all the fuss is over.

Charles Wheelan’s Naked Statistics is the perfect book to satisfy that curiosity. As the subtitle suggests (“Stripping the Dread from the Data”), the book is an easy-to-read guide to the foundations and limitations of statistical methods. Wheelan cuts out as much of the math as possible, instead using long winded (but very entertaining) stories to drive home key statistical concepts. This method does have its drawbacks. For some reason, Wheelan sometimes seems to compromise accuracy in the interest of his beautifully clear prose. (He claims erroneously on page 79 that adding “each outcome multiplied by its expected frequency” will always give you 1. It won’t. It will give you the expected value -- but more on this soon.) If he occasionally misspeaks, however, he deserves to be forgiven. The prose is good.

For a student who’s taken a course in statistics or probability theory, the book’s more technical math will be a review. Wheelan gives a brush up on how to calculate variance, and how that relates to standard deviation. He also devotes a chapter to an intuitive account of correlation, without going into the mathematical details. Those who aren’t math geeks may be tempted to skip these parts – you can, without losing much of the big picture. The “Basic Probability” section is important though, because it explains how to calculate statistical averages, or “expected values”. (You may have encountered this term if you’ve taken statistics, probability, or an intermediate economics course.) Expected values are important – they tell us the overall outcome you can expect over a long period of time. Take rainfall, which is usually calculated as a yearly average. While the rainfall on any given day may be pretty unknowable, the average rainfall in an area is a good measure of whether it’s the right place to grow your crops.

The really interesting math comes in when Wheelan explains statistics’ theoretical underpinnings. When a study says that 70% of males between 18 and 36 tend to brush their teeth at least once day, how did statisticians come up with that number? Generally, statisticians studying a population don’t actually evaluate every member of the population; they judge a sample group, and extrapolate the results of the group to the larger population. But how is that valid? As Wheelan explains, it comes down to the “normal distribution” and something called the Central Limit Theorem. Some useful background will help us here. The normal distribution is another name for the famous bell-shaped curve; it’s probably the most famous of what are called “probability density functions”. A probability density function (PDF) is a pattern that shows what the statistical likelihood is of a “continuous” set of outcomes. Let me explain what I mean by “continuous”. Take, for instance, the height of male orthodox Jews in Yeshiva College. Something like height doesn’t come in discrete intervals; someone can be five foot five, five foot five and a half, five foot five and a half and a hundredth, or any conceivable fraction between five foot five and half and five foot six. Assigning a statistic to each possible outcome is impossible. For example, consider the probability of finding someone that is exactly five foot five and three thousandths of an inch. The odds are essentially zero. Instead, a probability density function describes the probability of finding heights within a certain interval – say, between five foot five and six foot, including all the infinite possibilities contained in between.

The “bell-shaped” normal distribution is useful because it neatly conveys the likelihood of given intervals. If something is described by a normal distribution, then 95% of the time it manifests within two standard deviations of the average outcome. As long as the standard deviations are not that large, this can give us a pretty narrow window of what values we can expect. For example, if the average height of Jewish males in Yeshiva College is five foot five, standard deviation two inches, then 95% of Jewish males can be expected to be between five foot one and five foot nine.

All well and fine, if something follows a normal distribution. But who says everything does? Is it really true that 95% of heights are dispersed within two standard deviations of the mean? What’s amazing is that almost anything can be brought into relation to the normal distribution, at least in terms of calculating averages. If you take a random sample of a population, say a group of Yeshiva students, and measured them, it’s very likely that their heights won’t correspond to a normal distribution. However, if you calculated their average height, that average height would likely be close to the average height of the entire Yeshiva population. In fact, if you took multiple sample groups and calculated their averages, those averages would be scattered in a normal distribution around the actual average of the entire population. By measuring a few sample groups, we can calculate with high likelihood the average height of everyone in YC. This fun fact is called the Central Limit Theorem.

Sorry if the math was a bit dense. But if you found this interesting, you’ll love Wheelan’s chapters on “The Central Limit Theorem”, “Inference” and “Polling”. Perhaps the most important take-away from the book, however, is the limitations of statistical studies. The hard part of a statistical study is organizing the experiment, not crunching the numbers. The process is laden with pitfalls. One is getting a representative sample, which isn’t as easy as it might seem. To take an example from the elections, The Atlantic’s polling expert, Andrew McGill, recently discussed the difficulty of using web polls to predict how people will vote. The problem is, many older people, who are most likely to vote, don’t spend much time on the web[2]. More generally, sometimes the way a poll is conducted gives too much weight to a specific sector of the population. This is known as selection bias, and Wheelan discusses it in “The Importance of Data” along with other issues that can confound statistical conclusions.

The greatest and most systemic source of concern – and this will take us back to psychology’s replication crisis – is the fact that statistics only measure probabilities. If patients in a drug study show “statistically significant” improvement, it means the said improvement is not likely the result of chance. But unlikely things do happen, and if enough trials are repeated, then they are bound to happen. This leads to “publication bias”: the fact that only studies with positive results get published. The ones that yield no results don’t, which skews the published studies’ statistical validity. This is what Bloom was talking about when he said psychologists’ method of publication “is biased toward ‘false positives’”. Organizations that fund research try to address this: they force researchers to submit records of all their trials. But for every study that produces significant results, it’s important to ask: How many came up negative and weren’t published?

Its dangers notwithstanding, statistical inference remains a powerful tool of the scientific method. (I haven’t even mentioned regression analysis!) New statistical studies are published every day, advancing the front of human knowledge. Or at least purporting to. It’s not a bad time to take a peek under their hood with Naked Statistics.