From Introduction to Computation & Programming Using Python by John Guttag [used in EdX's 6.001 & 6.002x courses which are excellent.]
I'm fascinated by how we can be misled with statistics and how our minds come to conclusions when we see various numbers and data. Even professionals go astray when interpreting results from studies.
In Chapter 16, Lies, Damned Lies & Statistics, Guttag catalogs the various errors and misinterpretations of statistics.
Some of the more fascinating ones are:
16.2 Pictures can be Deceiving
Here, housing prices during the financial crisis are depicted in various bar charts. In one of the charts, the housing prices look stable, even though we know that prices actually did drop. This was accomplished by using a logarithmic scale on the prices which 'minimized the amount of space devoted to the area where prices are changing, giving the impression that the changes were relatively small". The chart ranged from an absurdly low price of $10,000 all the way to $1,000,000.
In the other chart which shows housing prices behaving erratically, the bar chart used a linear scale & a narrow range of prices, so the sizes of the changes were exaggerated.
In this way, data can be depicted to prove whatever point one wishes to make!
16.3 Cum Hoc Ergo Propter Hoc ['with this, therefore because of this']
This section is about correlation and believing that one trend causes another. There's also being conscious of lurking variables or confounders.
16.4 Statistical Measures Don't Tell the Whole Story
This section is fascinating! It shows statistician F.J. Anscombe's 1973 paper about statistical data and how summary data of 4 different data sets can be the same, yet look completely different!
For example, all the data sets have the same mean x values, same mean y values, same variance for x, same variance for y, and so on. One might see that and conclude that all the data sets are basically the same except for trivial differences. However, if you plot the different data sets, only then will you realize how different they all are. For example, some are completely skewed because of outliers, but the summary data would not capture that. Key takeaway: Look at your data and don't rely completely on summary statistics!
16.5 Sampling Bias
This section talks about the well-known phenomenon of bias. There are different types of bias. One example is non-response bias where surveys may not be returned so your data is only reflecting responses from those who decided to participate. Another is convenience (or accidental) sampling where one might choose only samples that easy to procure rather than the general population (e.g. college students are easy targets for studies) and so the study may not be representative and thus not generalize. The section ends with an interesting look at a flawed comparison of deaths between heterosexual persons vs. homosexual persons.
16.6 Context Matters
This is really interesting as it challenges what we hear on the news and conventional wisdom. And this happens when we view the data in isolation (which can lead to sensationalism) as opposed to viewing it in context.
E.g., In 2009, CNN reported that "swine flu outbreak has caused more than 159 deaths and roughly 2,500 illnesses". However, if you compare it to the 36,000 deaths attributable annually to the seasonal flu in the U.S., is this really news?
Also, it addresses the well known fact: most auto accidents happen within 10 miles of home" as well as the NRA's claim that "roughly 99.8% of firearms in the U.S. will not be used to commit a violent crime in any given year".
16.8 The Texas Sharpshooter Fallacy
This is about seeing things where there might not be anything to see; This can arise from looking at data and 'seeing a trend' that most likely arose from blind chance. One way to combat this is to actually start with a hypothesis and test that hypothesis. An example on anorexic women is presented in this section. What probably happens in practice is that the hypothesis tests that the researchers want to focus on fail and the researchers are left with a tremendous amount of data which they then mine in order to find other patterns. This could then lead to mis-interpreting chance results for actual trends.
It's similar to what I learned in the MedStats Edx class from Stanford taught by Kristen Sainani: If you torture your data long enough, it will confess. i.e., if you run multiple hypothesis tests across your data, chances are you will find something significant. However, it is likely that it will have arisen solely from chance.
16.9 Percentages Can Confuse
This is definitely a major thing to be aware of. Because percentage rises & falls are based on the original value before the rise/fall, they can be quite misleading. There is a nice example of a financial advisor talking to a client about his portfolio but the client interprets the statement differently.
This ties into MedStats when the Professor talks about relative risk and absolute risk. The relative risk when the basis of the percentage is small can be gigantic. E.g. If the incidence of a disease is 1 in 1,000,000 and when taking the drug it goes up by 200%, well, that means it's now 2 in 1,000,000. The 200% sure sounds like a lot, but is 2 in 1,000,000 really something to worry about? Using absolute risk here is a better indicator. So the absolute risk is 2/1000000 - 1/1000000 = 1/1000000 = .0001% increase in risk; Doesn't seem like I'll lose sleep over that increase in risk.
I'm fascinated by how we can be misled with statistics and how our minds come to conclusions when we see various numbers and data. Even professionals go astray when interpreting results from studies.
In Chapter 16, Lies, Damned Lies & Statistics, Guttag catalogs the various errors and misinterpretations of statistics.
Some of the more fascinating ones are:
16.2 Pictures can be Deceiving
Here, housing prices during the financial crisis are depicted in various bar charts. In one of the charts, the housing prices look stable, even though we know that prices actually did drop. This was accomplished by using a logarithmic scale on the prices which 'minimized the amount of space devoted to the area where prices are changing, giving the impression that the changes were relatively small". The chart ranged from an absurdly low price of $10,000 all the way to $1,000,000.
In the other chart which shows housing prices behaving erratically, the bar chart used a linear scale & a narrow range of prices, so the sizes of the changes were exaggerated.
In this way, data can be depicted to prove whatever point one wishes to make!
16.3 Cum Hoc Ergo Propter Hoc ['with this, therefore because of this']
This section is about correlation and believing that one trend causes another. There's also being conscious of lurking variables or confounders.
16.4 Statistical Measures Don't Tell the Whole Story
This section is fascinating! It shows statistician F.J. Anscombe's 1973 paper about statistical data and how summary data of 4 different data sets can be the same, yet look completely different!
For example, all the data sets have the same mean x values, same mean y values, same variance for x, same variance for y, and so on. One might see that and conclude that all the data sets are basically the same except for trivial differences. However, if you plot the different data sets, only then will you realize how different they all are. For example, some are completely skewed because of outliers, but the summary data would not capture that. Key takeaway: Look at your data and don't rely completely on summary statistics!
16.5 Sampling Bias
This section talks about the well-known phenomenon of bias. There are different types of bias. One example is non-response bias where surveys may not be returned so your data is only reflecting responses from those who decided to participate. Another is convenience (or accidental) sampling where one might choose only samples that easy to procure rather than the general population (e.g. college students are easy targets for studies) and so the study may not be representative and thus not generalize. The section ends with an interesting look at a flawed comparison of deaths between heterosexual persons vs. homosexual persons.
16.6 Context Matters
This is really interesting as it challenges what we hear on the news and conventional wisdom. And this happens when we view the data in isolation (which can lead to sensationalism) as opposed to viewing it in context.
E.g., In 2009, CNN reported that "swine flu outbreak has caused more than 159 deaths and roughly 2,500 illnesses". However, if you compare it to the 36,000 deaths attributable annually to the seasonal flu in the U.S., is this really news?
Also, it addresses the well known fact: most auto accidents happen within 10 miles of home" as well as the NRA's claim that "roughly 99.8% of firearms in the U.S. will not be used to commit a violent crime in any given year".
16.8 The Texas Sharpshooter Fallacy
This is about seeing things where there might not be anything to see; This can arise from looking at data and 'seeing a trend' that most likely arose from blind chance. One way to combat this is to actually start with a hypothesis and test that hypothesis. An example on anorexic women is presented in this section. What probably happens in practice is that the hypothesis tests that the researchers want to focus on fail and the researchers are left with a tremendous amount of data which they then mine in order to find other patterns. This could then lead to mis-interpreting chance results for actual trends.
It's similar to what I learned in the MedStats Edx class from Stanford taught by Kristen Sainani: If you torture your data long enough, it will confess. i.e., if you run multiple hypothesis tests across your data, chances are you will find something significant. However, it is likely that it will have arisen solely from chance.
16.9 Percentages Can Confuse
This is definitely a major thing to be aware of. Because percentage rises & falls are based on the original value before the rise/fall, they can be quite misleading. There is a nice example of a financial advisor talking to a client about his portfolio but the client interprets the statement differently.
This ties into MedStats when the Professor talks about relative risk and absolute risk. The relative risk when the basis of the percentage is small can be gigantic. E.g. If the incidence of a disease is 1 in 1,000,000 and when taking the drug it goes up by 200%, well, that means it's now 2 in 1,000,000. The 200% sure sounds like a lot, but is 2 in 1,000,000 really something to worry about? Using absolute risk here is a better indicator. So the absolute risk is 2/1000000 - 1/1000000 = 1/1000000 = .0001% increase in risk; Doesn't seem like I'll lose sleep over that increase in risk.