Next up from O’Reilly’s Statistics Hacks is: Describe the World Using Just Two Numbers. This is an explanation of something called the Central Limit Theorem.
Amongst all of the chi-squares and alphas, one of the few concepts I know very well is the mean … the average of a set up numbers, derived by adding up all the values and dividing by the number of points of data in the sample. Easy. What I didn’t know, though, is that there is something very interesting about that number. A mean is a number that is as close as a number can be to all of the values in the set. That is, if you added up all of the differences between the mean and a given number, that sum total would be smaller than if you did that with any other value. And, if you take that sum total for the mean and divide it by the total number of data points, you get the standard deviation. (Actually, the formula for S.D. is a little more complex, given the presence of negative numbers. It is the square root of the sum of the squares of each distance over the number of scores minus 1.)
The Central Limit Theorem states that “if you randomly select multiple samples form a population, the means of each of those samples will be normally distributed.” (Frey) That translates to being able to use standard deviation and mean — and at least N = 30 in the sample size — to project what the entire population is like. The larger the sample size, the more accurate, but even one sample of adequate size can be a good estimate. Dead math people told us so. What is ultimately calculated is the standard error of the mean, or the degree to which the sample mean would stray from the population mean.
This test can be done to see whether a sample was a random sampling. If the mean of the sample is not within the standard error of the actual mean, then the sample wasn’t drawn by chance. Being within the standard error is an indication that the sample was affected by “lots of random forces and unrelated events,” and thus has a normal distribution. This is true whether or not the population itself is normal.
Also: ,
Some definitions:
- the properties of the sample scores
- the properties of the entire population based on what is know about sample scores
- the use of two sample values and an assumption about the shape of the distribution across a population to accurately describe that population
- a fair summary representation of all scores in a sample
- the arithmetic average of all scores in a sample (total value / total data points), often the best measure of central tendency
- a representation of how far from center most of the data falls
- the average distance from each score to the mean, often the best measure of variability in a data set because it uses all values in the distribution
- the square of the standard deviation, most useful in comparing different distributions rather than describing a single distribution
- the degree to which the sample mean would stray from the population mean, calculated using standard deviation of a sample over the square root of the number in the sample
2 replies on “Stats Hack #2”
I’m enjoying your entries on stats… I’m certainly learning new things. A couple of questions though…
“If the mean of the sample is not within the standard error of the actual mean, then the sample wasn’t drawn by chance.”
Do you need a “95 times out of 100” in there somewhere?
“Being within the standard error is an indication that the sample was affected by “lots of random forces and unrelated events,†and thus has a normal distribution”
This doesn’t make sense to me. I understand why the means are normally distributed, but the samples themselves should generally be distributed like the population. Unless when you say “sample” here, you mean the sample of means. Which seems right, but makes my head explode a little. 🙂
My head is exploding, too. These supposedly simple concepts are the building blocks for everything else, and I find myself needing to really concentrate to make sure they stick.
What has stuck for me, thus far, is that the first thing you do is look at the mean and standard deviation to get a sense of what kind of thing you are looking at. Then, preumably, all sorts of number-crunching options await. But these basic measures are the things that best help to distinguish one data set from another.
This book, a quirky read that includes a few references to the reader as “dude,” doesn’t go into lots of details about some of this stuff. I anticipate that will come, as there are a number of references to later hacks. So for now, I have to try to absorb what I can and take a few things on faith.
One of those things is that a sample within the standard error is considered random; samples with a mean outside of the standard error is not random. Even that term — random — seems like it now needs some redefining in my mind. So, if the people involved in the sample are truly reflective of what, statistically, is a reasonable cross-section of the full population, then the sample’s mean should always be close to the full population’s mean. If it isn’t, some wackiness is ensuing, and it is no longer a random event.