Friday, November 14, 2014

Lies & Statistics

From Introduction to Computation & Programming Using Python by John Guttag [used in EdX's 6.001 & 6.002x courses which are excellent.]

I'm fascinated by how we can be misled with statistics and how our minds come to conclusions when we see various numbers and data. Even professionals go astray when interpreting results from studies.

In Chapter 16, Lies, Damned Lies & Statistics, Guttag catalogs the various errors and misinterpretations of statistics.

Some of the more fascinating ones are:

16.2 Pictures can be Deceiving
Here, housing prices during the financial crisis are depicted in various bar charts. In one of the charts, the housing prices look stable, even though we know that prices actually did drop. This was accomplished by using a logarithmic scale on the prices which 'minimized the amount of space devoted to the area where prices are changing, giving the impression that the changes were relatively small". The chart ranged from an absurdly low price of $10,000 all the way to $1,000,000.

In the other chart which shows housing prices behaving erratically, the bar chart used a linear scale & a narrow range of prices, so the sizes of the changes were exaggerated.

In this way, data can be depicted to prove whatever point one wishes to make!

16.3 Cum Hoc Ergo Propter Hoc ['with this, therefore because of this']
This section is about correlation and believing that one trend causes another. There's also being conscious of lurking variables or confounders.

16.4 Statistical Measures Don't Tell the Whole Story
This section is fascinating! It shows statistician F.J. Anscombe's 1973 paper about statistical data and how summary data of 4 different data sets can be the same, yet look completely different!

For example, all the data sets have the same mean x values, same mean y values, same variance for x, same variance for y, and so on. One might see that and conclude that all the data sets are basically the same except for trivial differences. However, if you plot the different data sets, only then will you realize how different they all are. For example, some are completely skewed because of outliers, but the summary data would not capture that. Key takeaway: Look at your data and don't rely completely on summary statistics!

16.5 Sampling Bias
This section talks about the well-known phenomenon of bias. There are different types of bias. One example is non-response bias where surveys may not be returned so your data is only reflecting responses from those who decided to participate. Another is convenience (or accidental) sampling where one might choose only samples that easy to procure rather than the general population (e.g. college students are easy targets for studies) and so the study may not be representative and thus not generalize. The section ends with an interesting look at a flawed comparison of deaths between heterosexual persons vs. homosexual persons.

16.6 Context Matters
This is really interesting as it challenges what we hear on the news and conventional wisdom. And this happens when we view the data in isolation (which can lead to sensationalism) as opposed to viewing it in context.

E.g., In 2009, CNN reported that "swine flu outbreak has caused more than 159 deaths and roughly 2,500 illnesses". However, if you compare it to the 36,000 deaths attributable annually to the seasonal flu in the U.S., is this really news?

Also, it addresses the well known fact: most auto accidents happen within 10 miles of home" as well as the NRA's claim that "roughly 99.8% of firearms in the U.S. will not be used to commit a violent crime in any given year".


16.8 The Texas Sharpshooter Fallacy
This is about seeing things where there might not be anything to see; This can arise from looking at data and 'seeing a trend' that most likely arose from blind chance. One way to combat this is to actually start with a hypothesis and test that hypothesis. An example on anorexic women is presented in this section. What probably happens in practice is that the hypothesis tests that the researchers want to focus on fail and the researchers are left with a tremendous amount of data which they then mine in order to find other patterns. This could then lead to mis-interpreting chance results for actual trends.

It's similar to what I learned in the MedStats Edx class from Stanford taught by Kristen Sainani: If you torture your data long enough, it will confess. i.e., if you run multiple hypothesis tests across your data, chances are you will find something significant. However, it is likely that it will have arisen solely from chance.

16.9 Percentages Can Confuse
This is definitely a major thing to be aware of. Because percentage rises & falls are based on the original value before the rise/fall, they can be quite misleading. There is a nice example of a financial advisor talking to a client about his portfolio but the client interprets the statement differently.

This ties into MedStats when the Professor talks about relative risk and absolute risk. The relative risk when the basis of the percentage is small can be gigantic. E.g. If the incidence of a disease is 1 in 1,000,000 and when taking the drug it goes up by 200%, well, that means it's now 2 in 1,000,000. The 200% sure sounds like a lot, but is 2 in 1,000,000 really something to worry about? Using absolute risk here is a better indicator. So the absolute risk is 2/1000000 - 1/1000000 = 1/1000000 = .0001% increase in risk; Doesn't seem like I'll lose sleep over that increase in risk.

Friday, November 7, 2014

The Normal Distribution

The Normal Distribution is the distribution that results when summing up a large number of chance events.
Actually the binomial distribution describes the probability distribution that occurs when talking about chance events happening with probability p. This is a discrete distribution but for large numbers of events, it very much resembles a normal distribution and so the normal distribution is an excellent approximation to the binomial distribution and also can be used for continuous variables.

they say errors are distributed normally. what does that mean?

The binomial distribution is a pain to calculate [and indeed was difficult before computers] and so the normal distribution was found to be much easier to use [even though the formula looks a bit horrendous] because the formula does have nice mathematical properties.

The normal distribution also occurs in nature, i.e., just by measuring phenomena and plotting them (e.g. levels of insulin in men), you will more than likely see your data normally distributed.

And the normal distribution also describes the variation in data were we to repeat an experiment many times.

The normal distribution also allows one to build something called confidence intervals which is a range that is likely to contain the desired unknown estimate and a degree of confidence that the unknown estimate lies within that range (e.g., candidate will get 52% of the vote +/- 4%; so the estimate is 52% and the interval is size 8). The calculation of confidence interval assumes distribution of errors of estimation is normal and has a mean of 0. [so if I were to poll x people y # of times I wouldn't expect that I would get exactly the same result but instead the results would vary each time slightly. This would be the errors of estimation being distributed normally]

Question on memoryless property of exponential or geometric distributions:
Rolling a fair die has a uniform distribution of likely events. Rolling a 1 or a 6 or any # in between has the same probability so this is uniform. However, rolling a die until I get a 6 is memoryless, is it not? Because if I haven't rolled a 6 until time t then probability of getting a 6 when rolling it s more times (i.e., s+t) should equal the probability of just rolling it s times from the beginning yes?


On Means and Medians

Source: Vickers "So what's a P value anyway?"
Source: MedStats Stanford Class
Source: Introduction to Computation & Programming to Python [from MIT's 6.02 on EdX]

Chapter on Mis-using Means in Vickers Book:

Average of 2.6 children per British woman survey example:

It sounds silly to say on average a British woman has 2.6 children. How can you have 2.6 children? doesn’t literally make sense.

  1. The usage of average here is known as the mean and is arrived at by let’s say: 1,000,000 women having 2,600,000 children. This would give you an arithmetic mean of 2.6. Another way you can express average is by using median, which would be to line up all the #s of children in sorted order and picking the one in the middle. Most likely the median for this survey was also 2. The main difference b/w mean and median is that the mean is an artificial abstraction [i.e., a mathematical calculation] and the median is something you can actually go out and see.

Why even use these artificial #s? The usage of ‘artificial’ #s like mean & standard deviation is actually convenient to answering lots of questions quickly. You can say 95% of observations are within 2 standard deviations of the mean; and 2/3 of observations are within one standard deviation of the mean, and so on. You can calculate this without consulting the real data, which would be more time consuming. You can also do hypothesis testing with these artificial #s.

So these artificial #s can be calculated but the important and relevant thing to ask is: What is it being used for? i.e., what's the purpose of this calculation?

Just saying that the average British woman has 2.6 children and leaving it at that seems silly. You want to do one of two things with statistics:
  1. Estimation : as a descriptive statistic [means should be reported along with the standard deviation but in this e.g., it wasn’t!]
  2. Inference : Hypothesis testing; test out a theory that one group has higher something or other than a control group

Moreover, with 2.6 children, you cannot test a hypothesis as such because many hypothesis tests can only be used on continuous variables whereas # of children is a discrete, categorical variable [only integers]

As said above, with estimation, reporting average of 2.6 british children without a standard deviation leaves out a lot of information. For example, most women might have either 2 or 3 children and a few women have lots and lots. [Outliers can skew the mean]. So a single number can rarely be used to describe data; measures of spread should be reported alongside estimates!!

A similar study showed British women having average of 2.1 children with standard dev of 1.1; When applying the rule of 95% of data are within 2 standard deviations of the mean, you get 95% of families have between -.1 and 4.3 children which is really ridiculous! So using the mean/standard deviation is not very good here. What’s more apt in a case like this is using a histogram.
# of children in family vs. percentage vs. cumulative percentage
it’s a basic table that breaks down # of children by percentages of family that have that # of children. You don’t have a single # any longer but it’s much more useful and accurate.

Questions to ask when confronted with such #s:

  1. Is this a continuous or categorical variable?
  2. What is the point of using the statistic? is it for estimation/descriptive or used for hypothesis testing?
  3. If there’s a mean what is the standard deviation? Consider the possibility if given the mean that outliers can really skew it. is that possible in this case?
  4. does 95% of the data fall within 2 standard deviations of the mean? is it even possible or is it nonsensical?
  5. how do the mean and interquartile ranges look?
  6. Last resort, plot a histogram of data and with cumulative percentages