Stats Hacks #6 – Blogschmog

Another bit of wisdom from Statistics Hacks to share … Hack #6: Measure Precisely.

The results of any experimental test are likely to be criticized first in two areas. Is the test reliable, and is it valid. This means that not only should it be able to be replicated again and again with similar results, but it also has to be proven to be a good measure of whatever it is one is trying to study.

The techniques to make these assessments is derived from classical test theory. This theory presumes there is a true result existing in the ether, and that the difference between it and the observed result is telling. With classical test theory, the reliability of a score can be calculated. This is a value between 0 and 1, where a higher number indicates greater reliability. With that reliability value, the standard error of measurement can be determined. The standard error is the product of the standard deviation of a group of observed scores and the square root of one minus the reliability.

So, test scores with high reliability will tend toward a smaller errors in measurement. And as the range of observed tests grows, so does the standard deviation and thus the standard error. It is generally accepted that 68% of the time the observed score will fall within one standard error of measurement. If we up that range to two standard errors of measurment — actually, it is more like 1.96 — then the observed score will fall within that group 95% of the time. The level of confidence still assumes that errors are random and therefore follow a normal distribution.

Why do all this? The reason is that knowing where a score falls within a given confidence range lets one know whether retaking the test is likely to improve one’s score. It’s great for GREs, when you are debating whether to retake the test to improve on that 540 verbal.

Also: ,

Some definitions:

validity

the degree to which a score represents the trait one wishes to measure, demonstrated through evidence and supporting theory

reliability

the degree to which a result can be duplicated with predictable results, typically presented as a number between 0 and 1 with 1 indicating no random error

Classic test theory

Sometimes called reliability theory, this equates an observed score with the sum of the ideal score and any error

observed score

the actual score reported for a given test

true score

the score that would have been observed had no random factors interfered, or the average of an infinite number of scores for a given situation

error score

the difference between the observed score and the true score

standard error of measurement

the average distance from each person’s true score to his observed score, used to calculate the range of scores within a true level of performance resides

By Kevin Makice