Stat Rescue: October 2016

INTRODUCTION

This topic usually gets a little tricky for students, so don't get worried if it doesn't all "click" right away. An effective way for many students is to think of this unit in parts:

I know how to compute Z Scores in general (if not, review the unit on computing Z Scores for an observation in a sample).
I used Z Scores to find the area under the standard normal curve that was associated with a certain score.
Now, I can use Z Scores in a similar way to find the area under the curve that is associated with a certain sample mean.
A major key is: If I don't know the "true" population mean, I can take a sample, compute the mean and use it to estimate the "true" population mean!

Let's do #4 again! It is just that important:

If I don't know the "true" population mean, I can take a sample, compute the mean and use it to estimate the "true" population mean!

THIS IS ONE OF THE GREATEST THINGS IN STATISTICS!

It truly is! Think of it-- If you want to know the height of all 320 million Americans, you can take a sample of 300 people and be able to estimate the average from that sample of 300!

You might be thinking:

"Dr. Oliver, this is too good to be true! There must be a catch!"

There is...But not much of one. The catch is that our sample needs to be large enough. How large is "large enough", you ask? It depends on several things, but we will focus on SAMPLE SIZE!

For the interested and highly-aspiring student (like you are!), homogeneity of the population is also a big deal! If your population of 320 million is EXACTLY the same, you only need a sample of 1 person to know everything about the population! If they are all completely and utterly different, you need a sample of 320 million to really know about the population.*

As you can see, the level of homogeneity ("sameness") and sample size go hand in hand!

We will not go into depth in this class on how to use homogeneity to determine a proper sample size (though it is a fascinating topic!). The point that you should appreciate is that we have 320 million unique individuals in the United States, so unless we sample them all, we will probably not get exactly the right information!

TRUE!

This is where estimation comes in! When we take a sample of 300 and find the mean of that sample, it will almost never be exactly equal to the population mean.

Think of it. Suppose the average number of hours spent by Americans playing video games per week is 2.5. In fact, we are rounding and the number is probably really something like: 2.5946713496838495698182933409485920948589209194890301038902... you get the idea...

What are the chances that my sample of 300 people will also have a mean of 2.5946713496838495698182933409485920948589209194890301038902...? Zero. Not gonna happen...
What are the chances that my sample of 300 people will have a mean of 2.5? Pretty high.
What are the chances that my sample of 300 people will have a mean of 2.4321 or 2.69784? Also pretty high. In fact, more likely than not!

MOST SAMPLES WILL BE SOMEWHERE NEAR THE "TRUE" POPULATION MEAN!

NO SAMPLE WILL BE EXACTLY THE POPULATION MEAN.

This is why some statisticians have been heard saying something that is probably incomprehensible to the rest of the world:

"No one is average."

Let's take a quick quiz to see if you know why statisticians say this.

{quiz: Based on what we have just learned why do statisticians say: "No one is average?" A. Because, while most people will probably fall near the mean, no one will be exactly the mean. B. Because of the law of large numbers. C. Because of the individualistic culture in America. D. Statisticians are wrong. Most people are average. }

Nobody is Average- a normal distribution curve with figures inside it and DNA as the curve

A graphic from a CDC.gov blog talking about how no one is average. https://blogs.cdc.gov/genomics/2014/07/02/nobody-is-average/

We will leave the argument right there about whether or not anyone actually is average, because the only point you need to take away is that, if we took a random sample of 300 people and did so 1,000 times, none would be the same, and none would match the population average.

For the student that likes hands-on learning, go back to my simulation and try it! Look at the average for the sample of 300 and draw new samples over and over until you are convinced that none will be the same...

{File Insert: Sample Simulator}

This unit is divided into three parts:

I. Review of Z Score calculations
II. Review of differences if you want to estimate a population parameter
III. "How good is the sample?" An introduction to confidence!

REVIEW OF Z SCORE CALCULATIONS

Let's start of right away with a quiz!

{Quiz: Z to raw, raw to Z revisited!

Suppose Darrell is a pretty fast runner, but only feels better about himself if he is faster than other people [for shame, Darrell :( Try running for a PR instead!]. Darrell can run a mile in 5 minutes and 59 seconds, which is exactly 10.0278552 miles per hour (remember you can round to 2 decimal places!). If the American average mile time is 5.217391932455 miles per hour, and the standard deviation is 2.7896415893 miles per hour, then Darrell is faster than what percent of Americans?

Remember, you can use the Math Is Fun Z Score to area tool: https://www.mathsisfun.com/data/standard-normal-distribution-table.html

A. 95.73%

B. 45.73%

C. 1.72%

D. 4.81%

SOLUTION:

Remember, Darrell =10.03, Mean=5.22, sd=2.79.

1) Darrell is 4.81 mph faster than the mean (10.03-5.22=4.81). Z Score IS the number of standard deviations from the mean, so turn 4.81 into standard deviations=4.81/2.79=1.72.

2) 1.72 is the number of standard deviations that Darrell's speed is from the mean, and Z Score IS number of standard deviations, so Darrell's Z Score is 1.72.

3) Now, find the area BELOW 1.72 ("Up to Z") on the standard normal table.}

Now you have a review of the basics. Try another quiz with three more problems.

{Quiz: Z to raw, raw to Z: Practice makes perfect.

Open this link in another window or browser: http://www.espn.com/nfl/statistics/team/_/stat/total

The link will connect you to a table of data for all 32 NFL teams!

Use the "YDS" column (total yards for the NFL team) to answer the following questions. The mean for total yards "YDS" is 1410.969 and the standard deviation is 192.8788 (you may round to two decimal places):

1. What do the individual numbers under the "YDS" column represent if our population is "all NFL teams"?

A. Observations in a population

B. Observations in a sample

C. Sample means in a theoretical distribution

D. Means in a sample.

F/B: These are observations in a population. Our population is all NFL teams and, this is a table of "all 32 NFL teams" as mentioned below the hyperlink!

2. What percent of teams have more yards than San Francisco? (USE Z Scores! Do not count by hand!)

A. 40.52%

B. 23.9%

C. 9.1%

D. 59.1%

F/B: Here we have the same thing as the last quiz, just different numbers!

San Francisco is 46.03 yards above average (1457-1410.97=46.03). Divide by s.d.=46.03/192.88=.238. That is the Z Score. Find AREA BEYOND Z ("Z ONWARDS") in the Math Is Fun table and that is the answer!

3. What percent of teams have yard totals that fall between those of Pittsburgh's and Denver?

A. 26.07%

B. 45.0%

C. 87.88%

D. 7.8%}

The last question takes a little creativity in combining concepts you have already learned! Let's call Pittsburgh's yardage (1498) the UPPER score and Denver's (1369) the LOWER score. We want the area between. RULE #1: DON'T PANIC! YOU CAN DO THIS!

Step 1: Compute the UPPER Z Score

Step 2: Compute the LOWER Z Score

Those should be pretty straightforward.

Step 1: 1498-1410.97=87.03. 87.03/192.88=0.45. Z=0.45.

Step 2: 1369-1410.97=-41.97. -41.97/192.88=-0.22. Z=-0.22.

Now, this is where many get stuck. There is no "Area between two Z Scores" on the Z Score finder! Here is where that creativity comes in. It can best be facilitated by drawing pictures! Students that use this picture drawing technique improve their ability to do statistics over the course of the semester by 3 times as much as those who do not!

Here is the picture drawing strategy:

Now go back to your Z Scores and sketch them onto the picture too!

Hopefully you can see that we now have something we can work with! Specifically, the area between the mean and -.22 + the area between the mean and .45 will give us our shaded area!

So, use the Z Score finder from Math Is Fun to find each area and you get: 17.36 for area between mean and Pittsburgh, and 8.71 for the area between the mean and Denver. (Notice, there is no negative area! Area is area--just in different directions!)

17.36+8.71=26.07 and that is the answer! 26.07% of teams have total yardage between that of Denver and that of Pittsburgh.

Notice there are two ways to do the same thing:

Either works, but there are benefits to learning to use both. If you learn both, you will have more tools for quickly finding areas of interest. Students that learn both also tend to better understand what it means to find a certain area beneath the curve.

Differences for population estimate versus sample statistics

In the last unit, you learned that we can use this same idea of computing Z Scores when we want to know how good of an estimate our sample mean is of the "true" population mean.

You may also remember that we had to make some adjustments.

When we don't know the population parameters, we must adjust!

So, we know that we find the standard deviation of an entire population like this:

Do you recognize that? If not, pull out your Statistics dictionary!

In short (or long): "Subtract the population mean from each observation of interest, square it, add it all up and divide by the number in the population"!

They are your old friends, the columns!

Column 1: X (the observations of interest)
Column 2: μ (we were calling it "X bar" or the mean)
Column 3: X-μ (observations minus the mean)
Column 4: (X-μ)^{2 (Square the differences)}
Column 5: (X-μ)²/N (Divide by total number of observations)

This was variance!

Now take the square root: √(X-μ)²/N

“Holy Z Score Batman!” You have known "√(X-μ)²/N" all along!

What a statistician you are becoming. I like to think that this is your proudest day in statistics! If not, perhaps it should be! It means that you understand your first awesome equation and can impress your friends and colleagues.

Suppose you are in a research meeting and someone that is very statistically-minded sends you an email that says: "Hey, {fill in your name here}. Can you run a quick √(X-μ)²/N for me on that dataset?"

You will now reply: "Sure. NP. ("No Problem" in my generation...)" :)

{Quiz: What is: σ= √(X-μ)²/N?

A. Standard deviation of the population

B. Standard error

C. Standard Operating Procedure

D. The name of a Biology Textbook in Athens}

It is Stats Language for "the standard deviation of the population!"

Adjustments:

Now that you are have the equation fresh in your mind, let's turn back to the adjustments we need to make when we do not know the population parameters!

When we do not know the population parameters, we must adjust.

The reason we must adjust is because when we do not have the whole population, we can only estimate. Statisticians have spent years finding ways to make the estimates better!

The first adjustment is (n-1).

When we take a sample of the whole population, we must use (n-1) instead of n!

So, for σ (population standard deviation) we have: √(X-μ)²/N.

For s (sample standard deviation) we have √(X-Xbar)²/(n-1).

Study up for a moment and then take an quiz!

{Quiz: Matching.

σ, s

AND

√(X-μ)²/N, √(X-Xbar)²/(n-1)}

Notice that population parameters are often Greek and sample statistics are often Roman (letters you recognize as an English speaker).

So, σ (lower-case sigma) is population standard deviation and s is sample standard deviation.

Xbar is the mean of a sample, but μ (myu, moo) is the population mean!

So far, when the population parameters are not known, or you do not have every element of the population:

We must divide by (n-1) instead of n when computing variance and standard deviation.
We use Greek symbols instead of Roman letters we know!

Now, we are ready to add just a little bit more. Go back to this point:

#1. σ (population standard deviation) = √(X-μ)²/N.

#2. s (sample standard deviation) = √(X-Xbar)²/(n-1).

Which one lets us compute the Z Score for a sample mean as an estimate of the "true" population mean?

#1. almost never exists in practice. Unless your population (the group you want to study) is a school of 200 registered students (or something similar--even a large university registers and keeps track of every student so it has a known population), you may never know the "true" population parameters. This is because you either would have a very hard time locating every member of the population or a very hard time getting information from all of them! (Review previous units on the difficulty--nay, the near impossibility--of surveying the whole nation!)

So, in practice, #1 is usually out. We used it to get us to this point, but from here on out, it will be of little use to us. Good-bye #1!

#2. Here is a much more practical equation. We take a sample of, say, 300 people, compute standard deviation and all is well--UNLESS YOU WANT TO KNOW SOMETHING ABOUT THE POPULATION (which we do)! This only tells us about our sample!

We need something more!

We need standard error!

You are becoming so good at Stats language that you can perhaps translate this directly into English. You can use your Stats dictionary for help:

Standard Error=s/√n

When you are ready, proceed to a quiz:

{QUIZ: Standard Error

Standard Error=s/√n. This means that Standard Error is computed by:

A. Taking the sample standard deviation and dividing it by the square root of the number of observations in the sample.

B. Taking the sample standard deviation and dividing it by the square root of the number of elements in the population.

C. Taking the population standard deviation and dividing it by the square root of the number of elements in the population.

D. Taking some time off after this course...

2 attempts allowed}

S=sample standard deviation. √n=square root of the number of observations in the sample. (Notice n=number in the sample, and N=number in the population). So Standard Error (SE) is computed by simply dividing the sample standard deviation by the square root of the number of observations in the sample.

Now you try:

{Quiz: Standard error practice

Suppose you take a sample of all Americans. Your sample size is 300. Your sample has mean years of education = 11.3. You compute sample standard deviation and find that it is 4.6. Compute Standard Error (SE).

A. 0.27

B. 17.32

C. 2.46

D. 8.88

F/B. It is simply sample standard deviation divided by the square root of 300! 4.6/√300.}

Standard Error is a very magical thing! Standard error is technically the standard deviation of the sampling distribution.

Remember: The sampling distribution is a mostly theoretical distribution of all possible samples of a given size!

Year after year of teaching statistics, a student says something like, "But, Dr. Oliver, how would I actually find the sampling distribution in practice?"

YOU WOULDN'T! It doesn't make sense! In most cases in practice, if you have that much time and money (and omniscience), you might as well go collect information for all elements of your population and be done!

If you like, you can think of this as "Stats Magic". Here is the "magic":

If you take a sample and compute the mean and standard error (using sample standard deviation), you can use standard error to tell you what percent of sample means would theoretically be larger or smaller than your sample mean.

Stop and think about that for 20 seconds...

The standard normal table (like the one from Math Is Fun) is now no longer a graph of how many observations there are at each point, now, it is a graph of how many sample means would fall at a given point.

When we want to estimate the percent of area under the curve between two sample means in the theoretical sampling distribution, we use Standard Error in the way we used to use standard deviation! So, Z Score is still the number of standard deviations, but now you must use standard deviation of the theoretical sampling distribution (AKA Standard Error!).

Often the best way to see it is to try it yourself!

{Quiz: Your first estimate with SE

Suppose you take a sample of all Americans. Your sample size is 300. Your sample has mean years of education = 11.3. You compute sample standard deviation and find that it is 4.6. You also compute Standard Error (SE) and find that it is 0.27.

What percent of samples will have a sample mean between 11.03 and 11.57?

A. 68.2%

B. 1.2%

C. 0.2%

D 2.2%}

Just like 68.2% is the percent of the area 1 standard deviation in either direction of the mean, it is also the area 1 standard error in either direction of the sample mean!

Let's summarize what we have so far:

σ (population standard deviation) =√(X-μ)²/N.
s (sample standard deviation) = √(X-Xbar)²/(n-1).
SE (theoretical sampling distribution standard deviation) =s/√n

If you working ahead, you may have noticed that, by substitution, SE=

(√(X-Xbar)²/(n-1))/√n

Does that look terrifying? It just means that you compute Standard Error by taking sample standard deviation [√(X-Xbar)²/(n-1)--your old friend, the columns!] and dividing it by √n!

You are going to look very smart to your friends and colleagues very soon!

Here is the last section for today (you have shown excellent endurance!).

These modules have alluded several times to the idea that we never know the "true" population parameters.

This week, we have also learned that our sample mean will probably never match the "true" population mean, but will probably be close to it.

Think of our bell curve again:

The middle (dashed green) line is now the "true" population mean! 1Sd is now 1SE, 2Sd is now 2SE and so on...

Knowing this, we can tell how CONFIDENT (<--key word alert!) we are about the location of the "true" population mean.

This is great stuff! Stats Magic!

Go back to Z Scores of observations in a sample. Suppose we have a classroom of 35 students with an average ACT score of 17.37 and standard deviation of those scores equal to 7.50.

Now (AND THIS IS WHERE IT GETS GOOD!) suppose someone walked up to you and said, "Mr./Mrs./Ms. Emerging Statistician, I will bet that if I pick a random student from the class, you will not be able to guess their ACT score."

Being an emerging statistician, you know your best guess is the average score, but you also know that most of the scores are NOT average! Guessing the average score gives you the best chance of being the CLOSEST to any randomly picked score, but it gives you almost 0 chance of picking the exact score!

Try the "No One is Random" simulator on the next page if you want to see it in action. (Macros must be enabled).

{Next page}

Ok. You will never be able to do it.

So, having taken Dr. Oliver's class, you get an idea and counter, "Mr./Mrs./Ms. Person That Is Challenging Me To A Bet, I will make you a new bet. We all know that your proposal is a highly unlikely wager, but I know some fancy statistics and I can give you a range that will contain the score of whichever student you draw at random from the classroom."

Blindsided by your statistical knowledge the person allows it.

Now, suppose you want a 95% chance of getting it right, what range would you give?

Pause to think for a minute or two. What range would you give?

You know the mean is 17.37 and that most observations fall close to the mean. So just take the 95% that are closest to the mean!

Just take the 95% that are closest to the mean! (Yes, that was deliberately repeated for emphasis!)

Notice that 95.4% of cases fall between -2Sd and +2Sd.

As it turns out, 95% of cases fall between -1.96Sd and +1.96Sd. You can see this yourself by using the Z Score calculator and finding 47.5% of the area between 0 and Z. (47.5+47.5 adds up to 95!).

To do this, we multiply S by 1.96 and -1.96 then add each to the mean:

1.96*7.5= 14.7
-1.96*7.5= -14.7

Now add (subtract) each to (from) the mean of 17.37:

17.37+14.7=32.07
17.37-14.7=2.67

Now, you can say that you are 95% CONFIDENT (<--key word alert!) that the student to be drawn at random will have an ACT score between 2.67 and 32.07.

This same idea works for sample means when we want to capture the "true" population mean.

Suppose now that you want to know the "true" mean ACT score for all Americans. You take a random sample of 150 people and ask their ACT score. The sample has an average of 18.59 and standard deviation of 6.33. Use standard error (instead of sample standard deviation) to find the 95% CONFIDENCE INTERVAL (the two numbers that are the lower and upper bound of the middle 95% of the distribution).

First, find standard error by dividing standard deviation by √n:

6.33/√150=6.33/12.25=0.52=SE

Now we need to find 1.96 Standard Errors in either direction from the mean.

1.96*0.52=1.01

-1.96*0.52= -1.01

Then add/subtract to/from the mean:

18.59+1.01=19.60

18.59-1.01=17.58

So, we are 95% confident that the interval between 17.58 and 19.60 contains the "true" population mean!

We call this a 95% Confidence Interval.

It also means that we were not willing to accept greater than a 5% chance of being wrong...

We call that our ALPHA level. It is represented by the Greek alpha= α

Conclusions!

This was a big unit this week. You have learned:

σ (population standard deviation) =√(X-μ)²/N

Used when you have all elements of the population (RARE!)

s (sample standard deviation) = √(X-Xbar)²/(n-1)

Used when you want to compute Z Scores for observations in a sample

SE (theoretical sampling distribution standard deviation) =s/√n

Used when you want to compute Z Scores and Confidence Intervals for sample means relative to the sampling distribution

1.96 "Standard Errors" from the mean in either direction gives us a 95% Confidence Interval.
ALPHA level (represented by the Greek alpha= α) reflects the greatest chance we are willing to take of being wrong! (We will refine that wording a little in future units!)

We will see the concepts more and you will get more practice with them in order to get very familiar with them!

Good luck as you continue on the adventurous path of statistics!

Stat Rescue

Pages

Friday, October 21, 2016

Intro Social Statistics Decision Tree

Equation Sheet

Friday, October 14, 2016

Hypothesis testing--"Statistics on trial"

Preamble

What is hypothesis testing?

Here's an example:

Example

Now you try a quiz. After that, go enjoy your Fall Break!

Friday, October 7, 2016

ESTIMATION