This unit explains different measures of variance. Measures of variance refer to how spread out a dataset is. Measures of variance include:
- The range
- Interquartile range
- Standard deviation
- Variance (this one seems obvious!)
After you complete this page and the quizzes on it, you should have a pretty solid foundation for understanding measures of variance.
If you think you may already be a pro at this, just skip the explanations and go right to the quizzes. If you pass all the quizzes, you may be ready to move on!
Range = The biggest number - the smallest ("Max-Min")
Pretty easy...Why don't you have a go at it:
What is the range of the following dataset?
2,3,3,3,4,5,5,6,7,9,10
Notably, the "min" and "max" exclude outliers so they may not match up with the very smallest and very largest numbers. It is different with each software package you use, but it is often 3 times the IQR above the mean (for the "max" line) and 3 times the IQR below the mean (for the "min" line).
Here the boxplot is turned sideways, but it can be shown vertically as well:
Standard deviation
Standard deviation is a staple of statistics. It seems to appear almost everywhere. It also provides the foundation for other concepts, like Z-scores and "standard error" (which is the standard deviation of the sampling distribution).
You don't have to understand either of those things right now to understand standard deviation. The point is, LEARNING ABOUT STANDARD DEVIATION WILL PAY OFF OVER AND OVER AGAIN IN STATISTICS! So, take good notes!
A ruler gives a "standard" set length that you can use over and over again to compare things. If we are both 6 feet tall, we know we are the same height even if we didn't use the same ruler, because they are set to be the same.
So, standard deviation means, "set distance from the mean".
While you may not see that definition in you class or textbook, it is a useful way for you to think of it.
WATCH OUT!!
The tricky part about standard deviation is what it is that is "set". You are used to set length in the case of a ruler, or a set number of seconds in a minute (60). BUT STANDARD DEVIATION DOESN'T WORK LIKE YOU MIGHT EXPECT!
Standard deviation has ties to calculus (area under a curve), but the good news is, this unit will get you through it without even a drop of background in calculus!
First, let's SEE what standard deviation looks like:
This shape is called a "bell curve" because it looks like a bell. This is covered in-depth in the unit on distributions. 1 standard deviation from the mean is labelled as "1Sd". 2 standard deviations from the mean is "2Sd" and so on. Negative standard deviations are labelled as negative (-). So "-2Sd" means, "2 standard deviations below the mean".
Here are the things you need to take note of:
1) The mean is right in the middle of the bell curve.
2) If you measure along the base of the bell curve, each standard deviation is the same distance from the next closest one--the spacing is "standard" or "set".
3) There is a certain area under the curve between any two standard deviation markers in this picture, and that area decreases as you move out to the edges.
Recall from the distribution lesson that a bell curve thins out the farther away from the mean you go in either direction. This is because most things are average. It sort of goes without saying, but the majority of observations fall closest to the mean in a bell curve. There are fewer and fewer observations the farther you go from the mean.
Think of it this way: The average male height in the world is around 5'9". How many men do you see that are between 5'5" and 6'1"? A LOT. Look again at the bell curve. The point labelled "-1Sd" happens to correspond to men that are 5'5" tall (don't worry how we know that yet...). The mean is 5'9". So, 34.1% of men are between 5'5" and 5'9" (Notice that is the area under the curve between -1Sd and the mean.
Similarly, 34.1% of men are between 5'9" and 6'1", because 6'1" happens to correspond to 1Sd (again, don't worry about how we know that yet.
How did you do? If you missed any, you maybe failed to notice that the space between each line is 4 inches! That is the "magic" key in this case. You don't have to understand either of those things right now to understand standard deviation. The point is, LEARNING ABOUT STANDARD DEVIATION WILL PAY OFF OVER AND OVER AGAIN IN STATISTICS! So, take good notes!
SO, WHAT IS STANDARD DEVIATION?
STANDARD:
When you think of "standard" think "RULER"--like a ruler you measure stuff with!A ruler gives a "standard" set length that you can use over and over again to compare things. If we are both 6 feet tall, we know we are the same height even if we didn't use the same ruler, because they are set to be the same.
DEVIATION:
Deviation means distance away from the mean.So, standard deviation means, "set distance from the mean".
While you may not see that definition in you class or textbook, it is a useful way for you to think of it.
WATCH OUT!!
The tricky part about standard deviation is what it is that is "set". You are used to set length in the case of a ruler, or a set number of seconds in a minute (60). BUT STANDARD DEVIATION DOESN'T WORK LIKE YOU MIGHT EXPECT!
Standard deviation has ties to calculus (area under a curve), but the good news is, this unit will get you through it without even a drop of background in calculus!
First, let's SEE what standard deviation looks like:
This shape is called a "bell curve" because it looks like a bell. This is covered in-depth in the unit on distributions. 1 standard deviation from the mean is labelled as "1Sd". 2 standard deviations from the mean is "2Sd" and so on. Negative standard deviations are labelled as negative (-). So "-2Sd" means, "2 standard deviations below the mean".
Here are the things you need to take note of:
1) The mean is right in the middle of the bell curve.
2) If you measure along the base of the bell curve, each standard deviation is the same distance from the next closest one--the spacing is "standard" or "set".
3) There is a certain area under the curve between any two standard deviation markers in this picture, and that area decreases as you move out to the edges.
Recall from the distribution lesson that a bell curve thins out the farther away from the mean you go in either direction. This is because most things are average. It sort of goes without saying, but the majority of observations fall closest to the mean in a bell curve. There are fewer and fewer observations the farther you go from the mean.
Think of it this way: The average male height in the world is around 5'9". How many men do you see that are between 5'5" and 6'1"? A LOT. Look again at the bell curve. The point labelled "-1Sd" happens to correspond to men that are 5'5" tall (don't worry how we know that yet...). The mean is 5'9". So, 34.1% of men are between 5'5" and 5'9" (Notice that is the area under the curve between -1Sd and the mean.
Similarly, 34.1% of men are between 5'9" and 6'1", because 6'1" happens to correspond to 1Sd (again, don't worry about how we know that yet.
LET'S USE THESE FACTS ABOUT HEIGHT TO FILL OUT THE BELL CURVE:
- Average height for men is 5'9"
- -1Sd corresponds to 5'5"
- 1Sd corresponds to 6'1"
Remember, the lines have EQUAL SPACING from one to the next. In this case it is 4 inches. BUT THE SPACING IS DETERMINED BY THE AREA UNDER THE CURVE, NOT THE NUMBER 4!
Yes, this means that, between 1Sd (1 standard deviation) and the mean, there will always be 34.1% of the bell curve's area! THAT IS WHAT IS ACTUALLY STANDARD ABOUT STANDARD DEVIATION!
STOP and think about that again: 34.1%, 13.6% and 2.1% are standard percentages between the mean and 1Sd, 1Sd and 2Sd, and 2Sd and 3Sd! They never change! You can memorize this right now if you want because it is that useful.
You may be thinking, "Why such arbitrary areas?" Let's take a look at it using an example. Remember, no matter what dataset we use, we should see that the standard deviations correspond to these same areas under the curve. But, for each dataset, we need to figure out what the distance will be from one line to the next! In the case of male height, that distance happens to be 4 inches. But it will be different for each dataset.
Let's try it with a made up dataset: GPAs of students at Spartan High School. For simplicity, we will have only 5 students. Here is the dataset:
Spartan High
|
|||||||||||||
2.2
|
|||||||||||||
2.7
|
|||||||||||||
2.9
|
|||||||||||||
3.4
|
|||||||||||||
3.6
|
|||||||||||||
NOW, HERE IS A STEP-BY-STEP EXAMPLE:
STEP 1: Compute the mean (add up the
observations, divide by the number of observations)
|
|||||||||||||
Spartan High
|
|||||||||||||
2.2
|
|||||||||||||
2.7
|
|||||||||||||
2.9
|
|||||||||||||
3.4
|
|||||||||||||
3.6
|
|||||||||||||
MEAN:
|
2.96
|
||||||||||||
STEP 2: Subtract the mean from each
observation:
|
|||||||||||||
Spartan High
|
|||||||||||||
2.2
|
2.2-2.96=
|
-0.76
|
|||||||||||
2.7
|
2.7-2.96=
|
-0.26
|
|||||||||||
2.9
|
2.9-2.96=
|
-0.06
|
|||||||||||
3.4
|
3.4-2.96=
|
0.44
|
|||||||||||
3.6
|
3.6-2.96=
|
0.64
|
|||||||||||
STEP 3: Square all the results from STEP 2:
|
|||||||||||||
Column A. squared:
|
|||||||||||||
Spartan High
|
A.
|
||||||||||||
2.2
|
2.2-2.96=
|
-0.76
|
0.5776
|
||||||||||
2.7
|
2.7-2.96=
|
-0.26
|
0.0676
|
||||||||||
2.9
|
2.9-2.96=
|
-0.06
|
0.0036
|
||||||||||
3.4
|
3.4-2.96=
|
0.44
|
0.1936
|
||||||||||
3.6
|
3.6-2.96=
|
0.64
|
0.4096
|
||||||||||
STEP 4: Add up the squared results from STEP
3:
|
|||||||||||||
Spartan High
|
A.
|
B.
|
|||||||||||
2.2
|
2.2-2.96=
|
-0.76
|
0.5776
|
||||||||||
2.7
|
2.7-2.96=
|
-0.26
|
0.0676
|
||||||||||
2.9
|
2.9-2.96=
|
-0.06
|
0.0036
|
||||||||||
3.4
|
3.4-2.96=
|
0.44
|
0.1936
|
||||||||||
3.6
|
3.6-2.96=
|
0.64
|
0.4096
|
||||||||||
1.252
|
<--Column B. added up
|
||||||||||||
STEP 5: Divide the result from STEP 4 by the
total number of observations:
|
|||||||||||||
observation
|
Spartan High
|
A.
|
B.
|
||||||||||
1
|
2.2
|
2.2-2.96=
|
-0.76
|
0.5776
|
|||||||||
2
|
2.7
|
2.7-2.96=
|
-0.26
|
0.0676
|
|||||||||
3
|
2.9
|
2.9-2.96=
|
-0.06
|
0.0036
|
|||||||||
4
|
3.4
|
3.4-2.96=
|
0.44
|
0.1936
|
|||||||||
5
|
3.6
|
3.6-2.96=
|
0.64
|
0.4096
|
|||||||||
1.252/5=
|
0.2504
|
||||||||||||
STEP 6: Take the square root:
|
|||||||||||||
Spartan High
|
A.
|
B.
|
|||||||||||
2.2
|
2.2-2.96=
|
-0.76
|
0.5776
|
||||||||||
2.7
|
2.7-2.96=
|
-0.26
|
0.0676
|
||||||||||
2.9
|
2.9-2.96=
|
-0.06
|
0.0036
|
||||||||||
3.4
|
3.4-2.96=
|
0.44
|
0.1936
|
||||||||||
3.6
|
3.6-2.96=
|
0.64
|
0.4096
|
||||||||||
1.252/5=
|
√0.2504=
|
0.5004
|
So, the last result (0.5004) is the standard deviation for this dataset.
Think about the steps we did to compute it: We found the mean, and then in STEP 2, we computed all the distances from the mean. We know standard deviation means a set measure of how far things are form the mean. So far everything makes sense. Finding the distance from the mean is a natural step.
But, why do we square the distances in STEP 3? Well, the answer is painfully (or painlessly) simple: if you add them up, they cancel each other out and you get ZERO! This makes it look like nothing is any distance away from the mean at all (which is not true!).
***NOTE: The mathematical average balances all the values exactly in the middle, so adding up the distances of a set of numbers from their average always gives you zero--because there is the same "amount" of value above the average as below average.***
To get rid of the problem of coming up with ZERO, we square all the numbers at this stage.
Because standard deviation is, mathematically speaking, the "average" deviation from the mean, we must divide by the number of observations at this stage to sort of average it out.
Finally, in STEP 6, we take the square root to reverse the effect of squaring them.
Note that your stats class or textbook may have you divide by 5 or by 5-1 ("n-1") in STEP 5. Dividing by 5 is called "uncorrected standard deviation" and dividing by 5-1 (dividing by 4) is called "corrected standard deviation".
***NOTE for more advanced students: Dividing by 5 introduces some bias related to the square root function. To correct this bias it is common to divide by "n-1" (number of observations minus 1) instead of by the number of observations. In the example above, this give us standard deviation of 0.5595--a slightly higher estimate. In some cases, "n-1.5" is used and almost completely eliminates bias.***
So, if standard deviation of the dataset from Spartan High is 0.5, and the mean is 2.96, can you fill out the rest of the bell curve? (Remember, start with the mean of 2.96 in the middle and use the spacing of 0.5 to add up or subtract down to the right answer).
Hopefully, you did well, but this can be tough! A key point is that the spacing between each line and the next on the graph is 0.5 grade points, because that is 1 standard deviation! The middle line is the mean (2.96). So if we want to know the GPA that goes with 2Sd, we simply add 2 units of 0.5 to 2.96 (2.96 + 0.5 + 0.5 =3.96). The same thing goes for the negatives except we use subtraction. So -2Sd would be (2.96 - 0.5 - 0.5 =1.96).
Another way to say it is that 0.5 is the standard deviation "ruler" for this dataset, so to get to 2Sd, we need to add two 0.5 length rulers to the mean.
REVIEW:
- The raw value of standard deviation changes from dataset to dataset, but is fixed for each one.
- In our first example of height, the standard deviation was 4 inches. 4 inches was "set" as the standard distance from the mean for that dataset. However, in the Spartan High dataset, there was a different standard deviation--0.5. While it was different from the height dataset, it was also fixed for the Spartan High dataset.
- Different standard deviations correspond to a certain area under the curve that is different from 1 to 2 to 3 standard deviations, but ALWAYS the same from one dataset to the next.
- In the height dataset AND the Spartan High dataset, the area between 1Sd and the mean was 34.1%. In fact that is always the area between 1Sd and the mean for every dataset! (Even though the raw number for Sd is different--4 for height and 0.5 for Spartan High).
- We can figure out what raw number for standard deviation goes with the set areas of 34.1%, 13.6% and 2.1% by finding the average distance from the mean. (STEPS 1 through 6 in the example).
- In other words, you don't have to do any calculus as long as you know how to compute the average distance from the mean like we did in the example. Doing so will give you the correct number for the standard areas of 34.1%, 13.6% and 2.1%.
- For example if you have a mean of 100, and compute Sd to be 25, then you know the area between 100 and 125 (100 + 25--the number 1Sd from the mean) is 34.1%. The area between 125 and 150 (between 1Sd and 2Sd) is 13.6% and so on.
VARIANCE
Variance is another way to refer to distance from the mean. Variance is standard deviation squared.If you do not know standard deviation, you must go through STEPS 1 through 5 above. If you know standard deviation, simply square it. As with standard deviation, be sure to know if your textbook or professor prefers dividing by n or n-1 (uncorrected or corrected variance). In rare cases, you may even be asked to correct by dividing by n-1.5!
No comments:
Post a Comment