Source: Vickers "So what's a P value anyway?"
Source: MedStats Stanford Class
Source: Introduction to Computation & Programming to Python [from MIT's 6.02 on EdX]
Chapter on Mis-using Means in Vickers Book:
Average of 2.6 children per British woman survey example:
It sounds silly to say on average a British woman has 2.6 children. How can you have 2.6 children? doesn’t literally make sense.
- The usage of average here is known as the mean and is arrived at by let’s say: 1,000,000 women having 2,600,000 children. This would give you an arithmetic mean of 2.6. Another way you can express average is by using median, which would be to line up all the #s of children in sorted order and picking the one in the middle. Most likely the median for this survey was also 2. The main difference b/w mean and median is that the mean is an artificial abstraction [i.e., a mathematical calculation] and the median is something you can actually go out and see.
Why even use these artificial #s? The usage of ‘artificial’ #s like mean & standard deviation is actually convenient to answering lots of questions quickly. You can say 95% of observations are within 2 standard deviations of the mean; and 2/3 of observations are within one standard deviation of the mean, and so on. You can calculate this without consulting the real data, which would be more time consuming. You can also do hypothesis testing with these artificial #s.
So these artificial #s can be calculated but the important and relevant thing to ask is: What is it being used for? i.e., what's the purpose of this calculation?
Just saying that the average British woman has 2.6 children and leaving it at that seems silly. You want to do one of two things with statistics:
- Estimation : as a descriptive statistic [means should be reported along with the standard deviation but in this e.g., it wasn’t!]
- Inference : Hypothesis testing; test out a theory that one group has higher something or other than a control group
Moreover, with 2.6 children, you cannot test a hypothesis as such because many hypothesis tests can only be used on continuous variables whereas # of children is a discrete, categorical variable [only integers]
As said above, with estimation, reporting average of 2.6 british children without a standard deviation leaves out a lot of information. For example, most women might have either 2 or 3 children and a few women have lots and lots. [Outliers can skew the mean]. So a single number can rarely be used to describe data; measures of spread should be reported alongside estimates!!
A similar study showed British women having average of 2.1 children with standard dev of 1.1; When applying the rule of 95% of data are within 2 standard deviations of the mean, you get 95% of families have between -.1 and 4.3 children which is really ridiculous! So using the mean/standard deviation is not very good here. What’s more apt in a case like this is using a histogram.
# of children in family vs. percentage vs. cumulative percentage
it’s a basic table that breaks down # of children by percentages of family that have that # of children. You don’t have a single # any longer but it’s much more useful and accurate.
Questions to ask when confronted with such #s:
- Is this a continuous or categorical variable?
- What is the point of using the statistic? is it for estimation/descriptive or used for hypothesis testing?
- If there’s a mean what is the standard deviation? Consider the possibility if given the mean that outliers can really skew it. is that possible in this case?
- does 95% of the data fall within 2 standard deviations of the mean? is it even possible or is it nonsensical?
- how do the mean and interquartile ranges look?
- Last resort, plot a histogram of data and with cumulative percentages
No comments:
Post a Comment