Thursday, May 7, 2015

MEASURES OF VARIANCE

HOW FAR SPREAD OUT SOMETHING IS 

This unit explains different measures of variance. Measures of variance refer to how spread out a dataset is. Measures of variance include:

  • The range
  • Interquartile range
  • Standard deviation
  • Variance (this one seems obvious!)
After you complete this page and the quizzes on it, you should have a pretty solid foundation for understanding measures of variance. 

If you think you may already be a pro at this, just skip the explanations and go right to the quizzes. If you pass all the quizzes, you may be ready to move on! 

Range = The biggest number - the smallest ("Max-Min")

Pretty easy...Why don't you have a go at it:

What is the range of the following dataset?

2,3,3,3,4,5,5,6,7,9,10

    2 to 10
    9
    8
    7
    6



How did you do?







Interquartile Range=The 3rd quartile-the 1st quartile. 

So, what's a quartile? It is literally a quarter (like 25 cents). So the first thing to do is to identify what 1 quarter of the data is. 

Look at this little dataset: 1,1,2,3,3,3,3,4,5,6,8,9. 

There are 12 observations, so a quarter (or 1/4) of 12 is 3. 

This means that 1,1,2 are the first quarter (or 4th or first 25%) of the dataset.

3,3,3 are the next (2nd) quarter.

3,4,5 are the 3rd quarter.

And 6,8,9 are the fourth or last quarter. 

However, in statistics we often talk about quartiles instead of quarters. Quartile means that little infinitesimally small point "between" one quartile and the next. So the first quartile is right between 2 and 3. In this case, we average the two numbers (2 + 3 / 2=2.5). 

The 1st quartile is 2.5

The 2nd quartile is 3

The 3rd quartile is 5.5 (right between 5 and 6)

And...wait for it...There is no "4th quartile"! At least not that we talk about in statistics. Some people challenge this, and I suppose there is a theoretical 4th quartile right after the last number, but in this case, we don't know what the number after that is, so we can't average it anyway...

Despite a few people that want to talk about a "4th" quartile you will never really see it pop up--so no worries!

Now, we have the 1st, 2nd and 3rd quartiles. 


SIDE NOTE: It turns out that the 2nd quartile (sometimes called the "middle" quartile) is also the median. (Remember the median is the number right in the middle? So is the 2nd quartile!)
Now that you know the quartiles, the interquartile range is very straightforward: Find the 3rd quartile and the 1st quartile, then subtract the 1st from the 3rd. 

Interquartile range = the 3rd quartile - the 1st quartile. 

NOTE OF CAUTION: In the example above we had 12 observations and 4 divided nice and evenly into it. But that is not always the case. Consider this mini dataset:

4,5,6,6,6,7,7,8,9,9,10

Here we have 11 observations. So 11/4=2.75. So it is a little harder to brake it up into quartiles (4ths). To do this, first find the median:


4,5,6,6,6,7,7,8,9,9,10

Median=7. 

Now divide the dataset into two smaller datasets, including the median in each:

4,5,6,6,6,7
              7,7,8,9,9,10

Now, find the median of each half (remember to include the median in each):

4,5,6,6,6,7 (6+6)/2=6

So Q1=6.

7,7,8,9,9,10 (8+9)/2=8.5

So Q3=8.5

Now we have 
Q1=6
Q2=7 (the median)
Q3=8.5

Can you compute the interquartile range? Remember it is just Q3-Q1 :)

IQR: 8.5-6=2.5
The interquartile range is often shown visually through a graphic known as the boxplot or box and whisker plot. 

bow and whisker plot explained

Notably, the "min" and "max" exclude outliers so they may not match up with the very smallest and very largest numbers. It is different with each software package you use, but it is often 3 times the IQR above the mean (for the "max" line) and 3 times the IQR below the mean (for the "min" line). 

Here the boxplot is turned sideways, but it can be shown vertically as well:







Standard deviation
Standard deviation is a staple of statistics. It seems to appear almost everywhere. It also provides the foundation for other concepts, like Z-scores and "standard error" (which is the standard deviation of the sampling distribution). 

You don't have to understand either of those things right now to understand standard deviation. The point is, LEARNING ABOUT STANDARD DEVIATION WILL PAY OFF OVER AND OVER AGAIN IN STATISTICS! So, take good notes!

SO, WHAT IS STANDARD DEVIATION?

STANDARD:

When you think of "standard" think "RULER"--like a ruler you measure stuff with!




A ruler gives a "standard" set length that you can use over and over again to compare things. If we are both 6 feet tall, we know we are the same height even if we didn't use the same ruler, because they are set to be the same.


DEVIATION:

Deviation means distance away from the mean.

So, standard deviation means, "set distance from the mean".

While you may not see that definition in you class or textbook, it is a useful way for you to think of it. 

WATCH OUT!!

The tricky part about standard deviation is what it is that is "set". You are used to set length in the case of a ruler, or a set number of seconds in a minute (60). BUT STANDARD DEVIATION DOESN'T WORK LIKE YOU MIGHT EXPECT! 

Standard deviation has ties to calculus (area under a curve), but the good news is, this unit will get you through it without even a drop of background in calculus!

 First, let's SEE what standard deviation looks like:

z-score normal distribution 
This shape is called a "bell curve" because it looks like a bell. This is covered in-depth in the unit on distributions. 1 standard deviation from the mean is labelled as "1Sd". 2 standard deviations from the mean is "2Sd" and so on. Negative standard deviations are labelled as negative (-). So "-2Sd" means, "2 standard deviations below the mean". 

Here are the things you need to take note of:

1) The mean is right in the middle of the bell curve.
2) If you measure along the base of the bell curve, each standard deviation is the same distance from the next closest one--the spacing is "standard" or "set". 
3) There is a certain area under the curve between any two standard deviation markers in this picture, and that area decreases as you move out to the edges. 

Recall from the distribution lesson that a bell curve thins out the farther away from the mean you go in either direction. This is because most things are average. It sort of goes without saying, but the majority of observations fall closest to the mean in a bell curve. There are fewer and fewer observations the farther you go from the mean. 

Think of it this way: The average male height in the world is around 5'9". How many men do you see that are between 5'5" and 6'1"? A LOT. Look again at the bell curve. The point labelled "-1Sd" happens to correspond to men that are 5'5" tall (don't worry how we know that yet...). The mean is 5'9". So, 34.1% of men are between 5'5" and 5'9" (Notice that is the area under the curve between -1Sd and the mean. 

Similarly, 34.1% of men are between 5'9" and 6'1", because 6'1" happens to correspond to 1Sd (again, don't worry about how we know that yet. 

LET'S USE THESE FACTS ABOUT HEIGHT TO FILL OUT THE BELL CURVE:

  • Average height for men is 5'9"
  • -1Sd corresponds to 5'5"
  • 1Sd corresponds to 6'1"
NOW YOU TAKE IT FROM THERE....

z-score normal distribution



  1. What height corresponds to the green dashed line "mean"?

  2. 5'1"
    5'10"
    6'0"
    5'9"

  3. What height corresponds to the blue line "-1Sd" (NEGATIVE 1 Sd)?

  4. 6'1"
    5'10"
    5'5"
    5'9"

  5. What height corresponds to the blue line "1Sd"?

  6. 6'1"
    5'10"
    5'5"
    5'9"

  7. Using the information above, how many inches are there from one line to the next?

  8. 1
    2
    3
    4

  9. What height corresponds to the yellow line "-2Sd" (NEGATIVE 2 Sd)?

  10. 5'1"
    5'10"
    5'9"
    5'5"

  11. What height corresponds to the yellow line "2Sd"?

  12. 6'1"
    5'9"
    5'5"
    6'5"

  13. What height corresponds to the brown line "-3Sd" (NEGATIVE 3 Sd)?

  14. 4'9"
    5'9"
    5'5"
    6'5"

  15. What height corresponds to the brown line "3Sd"?

  16. 4'1"
    5'5"
    5'3"
    6'9"



How did you do? If you missed any, you maybe failed to notice that the space between each line is 4 inches! That is the "magic" key in this case. 

Remember, the lines have EQUAL SPACING from one to the next. In this case it is 4 inches. BUT THE SPACING IS DETERMINED BY THE AREA UNDER THE CURVE, NOT THE NUMBER 4! 

Yes, this means that, between 1Sd (1 standard deviation) and the mean, there will always be 34.1% of the bell curve's area! THAT IS WHAT IS ACTUALLY STANDARD ABOUT STANDARD DEVIATION! 

STOP and think about that again: 34.1%, 13.6% and 2.1% are standard percentages between the mean and 1Sd, 1Sd and 2Sd, and 2Sd and 3Sd! They never change! You can memorize this right now if you want because it is that useful. 

You may be thinking, "Why such arbitrary areas?" Let's take a look at it using an example. Remember, no matter what dataset we use, we should see that the standard deviations correspond to these same areas under the curve. But, for each dataset, we need to figure out what the distance will be from one line to the next! In the case of male height, that distance happens to be 4 inches. But it will be different for each dataset. 

Let's try it with a made up dataset: GPAs of students at Spartan High School. For simplicity, we will have only 5 students. Here is the dataset:






Spartan High









2.2









2.7









2.9









3.4









3.6





















NOW, HERE IS A STEP-BY-STEP EXAMPLE:


STEP 1: Compute the mean (add up the observations, divide by the number of observations)












Spartan High









2.2









2.7









2.9









3.4









3.6








MEAN:
2.96



























STEP 2: Subtract the mean from each observation:












Spartan High









2.2
2.2-2.96=
-0.76







2.7
2.7-2.96=
-0.26







2.9
2.9-2.96=
-0.06







3.4
3.4-2.96=
0.44







3.6
3.6-2.96=
0.64















STEP 3: Square all the results from STEP 2:





Column A. squared:






Spartan High

A.






2.2
2.2-2.96=
-0.76
0.5776






2.7
2.7-2.96=
-0.26
0.0676






2.9
2.9-2.96=
-0.06
0.0036






3.4
3.4-2.96=
0.44
0.1936






3.6
3.6-2.96=
0.64
0.4096
























STEP 4: Add up the squared results from STEP 3:






















Spartan High

A.
B.






2.2
2.2-2.96=
-0.76
0.5776






2.7
2.7-2.96=
-0.26
0.0676






2.9
2.9-2.96=
-0.06
0.0036






3.4
3.4-2.96=
0.44
0.1936






3.6
3.6-2.96=
0.64
0.4096









1.252
<--Column B. added up











STEP 5: Divide the result from STEP 4 by the total number of observations:





















observation
Spartan High

A.
B.





1
2.2
2.2-2.96=
-0.76
0.5776





2
2.7
2.7-2.96=
-0.26
0.0676





3
2.9
2.9-2.96=
-0.06
0.0036





4
3.4
3.4-2.96=
0.44
0.1936





5
3.6
3.6-2.96=
0.64
0.4096









1.252/5=
0.2504













STEP 6: Take the square root:






















Spartan High

A.
B.






2.2
2.2-2.96=
-0.76
0.5776






2.7
2.7-2.96=
-0.26
0.0676






2.9
2.9-2.96=
-0.06
0.0036






3.4
3.4-2.96=
0.44
0.1936






3.6
3.6-2.96=
0.64
0.4096









1.252/5=
√0.2504=
0.5004



So, the last result (0.5004) is the standard deviation for this dataset. 

Think about the steps we did to compute it: We found the mean, and then in STEP 2, we computed all the distances from the mean. We know standard deviation means a set measure of how far things are form the mean. So far everything makes sense. Finding the distance from the mean is a natural step.

But, why do we square the distances in STEP 3? Well, the answer is painfully (or painlessly) simple: if you add them up, they cancel each other out and you get ZERO! This makes it look like nothing is any distance away from the mean at all (which is not true!). 

***NOTE: The mathematical average balances all the values exactly in the middle, so adding up the distances of a set of numbers from their average always gives you zero--because there is the same "amount" of value above the average as below average.***

To get rid of the problem of coming up with ZERO, we square all the numbers at this stage. 

Because standard deviation is, mathematically speaking, the "average" deviation from the mean, we must divide by the number of observations at this stage to sort of average it out. 

Finally, in STEP 6, we take the square root to reverse the effect of squaring them. 

Note that your stats class or textbook may have you divide by 5 or by 5-1 ("n-1") in STEP 5. Dividing by 5 is called "uncorrected standard deviation" and dividing by 5-1 (dividing by 4) is called "corrected standard deviation". 

***NOTE for more advanced students: Dividing by 5 introduces some bias related to the square root function. To correct this bias it is common to divide by "n-1" (number of observations minus 1) instead of by the number of observations. In the example above, this give us standard deviation of 0.5595--a slightly higher estimate. In some cases, "n-1.5" is used and almost completely eliminates bias.***

So, if standard deviation of the dataset from Spartan High is 0.5, and the mean is 2.96, can you fill out the rest of the bell curve? (Remember, start with the mean of 2.96 in the middle and use the spacing of 0.5 to add up or subtract down to the right answer).


z-score normal distribution



  1. What GPA corresponds to 3Sd?

  2. 2.73
    4.46
    3.96
    3.56

  3. What GPA corresponds to -2SD (NEGATIVE 2Sd)?

  4. 1.96
    0.96
    1.46
    2.56

  5. What GPA corresponds to 1Sd?

  6. 3.96
    3.56
    3.063
    3.46

  7. What GPA corresponds to -1Sd (NEGATIVE 1Sd)?

  8. 2.56
    1.56
    1.96
    2.46

  9. What GPA corresponds to 2Sd?

  10. 3.76
    3.36
    3.96
    3.46


Hopefully, you did well, but this can be tough! A key point is that the spacing between each line and the next on the graph is 0.5 grade points, because that is 1 standard deviation! The middle line is the mean (2.96). So if we want to know the GPA that goes with 2Sd, we simply add 2 units of 0.5 to 2.96 (2.96 + 0.5 + 0.5 =3.96). The same thing goes for the negatives except we use subtraction. So -2Sd would be (2.96 - 0.5 - 0.5 =1.96). 

Another way to say it is that 0.5 is the standard deviation "ruler" for this dataset, so to get to 2Sd, we need to add two 0.5 length rulers to the mean. 

REVIEW: 

  • The raw value of standard deviation changes from dataset to dataset, but is fixed for each one. 
    • In our first example of height, the standard deviation was 4 inches. 4 inches was "set" as the standard distance from the mean for that dataset. However, in the Spartan High dataset, there was a different standard deviation--0.5. While it was different from the height dataset, it was also fixed for the Spartan High dataset. 
  • Different standard deviations correspond to a certain area under the curve that is different from 1 to 2 to 3 standard deviations, but ALWAYS the same from one dataset to the next. 
    • In the height dataset AND the Spartan High dataset, the area between 1Sd and the mean was 34.1%. In fact that is always the area between 1Sd and the mean for every dataset! (Even though the raw number for Sd is different--4 for height and 0.5 for Spartan High).
  • We can figure out what raw number for standard deviation goes with the set areas of 34.1%, 13.6% and 2.1% by finding the average distance from the mean. (STEPS 1 through 6 in the example). 
    • In other words, you don't have to do any calculus as long as you know how to compute the average distance from the mean like we did in the example. Doing so will give you the correct number for the standard areas of 34.1%, 13.6% and 2.1%. 
    • For example if you have a mean of 100, and compute Sd to be 25, then you know the area between 100 and 125 (100 + 25--the number 1Sd from the mean) is 34.1%. The area between 125 and 150 (between 1Sd and 2Sd) is 13.6% and so on. 
The observant student may realize that we can use standard deviations to refer to thing like percentiles. For example, if your GPA corresponds to 3Sd (3 standard deviations above the mean), your GPA is in the 99.9th percentile. Why? Because as it shows in the bell curve chart, only 0.1% of the total area is bigger than a 3rd standard deviation score--so your GPA outscored 99.9% (100%-0.1%) of students! More on this in the unit on Z scores.

VARIANCE

Variance is another way to refer to distance from the mean. Variance is standard deviation squared. 

If you do not know standard deviation, you must go through STEPS 1 through 5 above. If you know standard deviation, simply square it. As with standard deviation, be sure to know if your textbook or professor prefers dividing by n or n-1 (uncorrected or corrected variance). In rare cases, you may even be asked to correct by dividing by n-1.5! 

HOPEFULLY THIS UNIT HELPED! KEEP PERSISTING AT YOU WILL BE A STATS MASTER SOONER THAN YOU KNOW!

PLEASE SUBSCRIBE TO THE BLOG AND LEAVE COMMENTS! CHECK OUT OUR OTHER STATS TOPICS BY CLICKING HERE.



No comments:

Post a Comment