The Normal Distribution
Many times, data follows particular patterns. Sometimes, these patterns appear consistently. It has been shown that a number of the patterns exist. In our case, we called these patterns distributions. One of the most famous distributions is the normal distribution. Sometimes this distribution is called the bell curve. A closely related distribution, the student’s t distribution, was discovered by Gosset while working for Guinness. The normal distribution is used in many fields ranging from physics, finance, biology, psychology, marketing and so on. We will now explore what the normal distribution is and how to interpret it.
What is the normal distribution?
The normal distribution looks like this
Notice how it looks like a bell. That is why it is often called the bell curve. The normal distribution has important numbers, or more specifically parameters, that help to describe its shape. They are the mean and standard deviation. The peak of the normal distribution is always centered at the mean. We typically describe the other parts by their distance from the mean. We often write down a given distribution in a shorthand notation in the following way:
For example, for a normal distribution with a mean of 10 and a standard deviation of 3, it would be written in the following way:
The normal distribution has the unique property that the mean, median and mode are all the same value. As stated in Measures of Central Tendency, distributions exist where they are skewed and the mean and median may not equal one another. Additionally, the normal distribution is symmetric about the mean. This means that the line is drawn the same way on both sides of the mean at similar rates. This will be helpful in problems for the next section. Therefore, it is significant that the normal distribution has these property.
Probability and the normal distribution
The normal distribution is often interpreted by means of probability and the area underneath the curve. The area underneath the curve corresponds to a probability. Additionally, it can be shown that
- 68% of the observations fall within 1 standard deviation
- 95% of the observations fall within 2 standard deviation
- 7% or most observations fall within 3 standard deviation
We call these properties the Empirical Rule or the 68-95-99.7 Rule. We provided an image of this below:
Example I
Therefore, one might ask the following: What is the probability of observing a value from a normal distribution within 1 standard deviation?
It is important to first understand what probability means in this context. The basic definition of probability is chance. For instance, out of a normal deck of cards with equal numbers of aces, spades, hearts, and diamonds, what is the chance of pulling a heart? The answer is 25%, as there are four different card types of equal numbers.
Back to the original question, the chance of observing a number from the normal distribution within 1 standard deviation is 68%. We know that this is true because of the Empirical Rule.
Question I
What is the probability that an observation will fall within 2 standard deviations of the normal distribution?
Answer I
By the Empirical Rule, the chance that an observation will fall within 2 standard deviations is 95% or .95.
Question II
We provide a normal distribution with a mean of 13 and a standard deviation of 2 in Figure 2.
What is the probability that an observation will fall between 11 and 15? Draw a picture of you answer.
Answer II
By the Empirical Rule, the chance that an observation will fall between 11 and 15 is 68%. We know this because the interval of one standard deviation for this normal distribution is 11 to 15. A drawing of the following is shown below using R. The code is provided below the figure.
## empirical rule ## x=c(seq(5,21,length=2000),5, 7,9,11,13, 15,17, 19, 21) x<-sort(x) y=dnorm(x,mean=13,sd=2) plot(x,y,type="l", xaxt="n", xlab="Standard Deviation", ylab="Probability", main="N(13,4)") axis(1, at = c(5, 7,9,11,13, 15,17, 19, 21)) ## coloring in the part of desire## x=seq(11,15,length=10000) y=dnorm(x,mean=13,sd=2) polygon(c(11,x,15),c(0,y,0),col="navy")
Question III
Using the same normal distribution as in the question above, what is the chance that an observation will fall between 13 and 17? Draw a picture.
Answer III
For this question, we will be relying on the symmetric property of the normal distribution and the Empirical Rule. We know that 95% of observations fall within 2 standard deviations of the mean by the Empirical Rule. For our case, this interval will be between 9 and 17. We provide an image of this below.
We stated previously that an important property of the normal distribution is that it is symmetric about the mean. Therefore, we recognize that if we split the distribution in half about the mean, we will then have half of the amount of observations. Therefore, we will have 47.5% of the observations between 13 and 17. See the figure below for a visual representation. Below that is the code to reproduce the image in R.
#question 4 solution## x=c(seq(5,21,length=2000),5, 7,9,11,13, 15,17, 19, 21) x<-sort(x) y=dnorm(x,mean=13,sd=2) plot(x,y,type="l", xaxt="n", xlab="Standard Deviation", ylab="Probability", main="N(13,4)") axis(1, at = c(5, 7,9,11,13, 15,17, 19, 21)) ## coloring in the part of desire## x=seq(13,17,length=10000) y=dnorm(x,mean=13,sd=2) polygon(c(13,x,17),c(0,y,0),col="navy")
Question IV
Using the same normal distribution, what is the chance an observation will be between 7 and 9? Draw a picture.
Answer IV
We know that 95% of observations fall within 2 standard deviations of the mean by the Empirical Rule. For our case, this interval will be between 9 and 17. We provide an image of this below.
We know that 99.7% of observation fall within 3 standard deviation of the mean by the Empirical Rule. For our case, this interval will be between 7 and 19. We provide an image of this below.
By the symmetric property of the normal distribution, we recognize that 47.5% and 49.85% of observations fall between 9 to 13 and 7 to 13 respectively. Since are only interested in the area under the curve from 7 to 9, we are not concerned about the area from 9 to 13. Therefore, if we subtract the 2 areas from one another, we will end up with 2.35% from 7 to 9. Therefore, the chance that an observation falls between 7 and 9 will be 2.35%. We provide a figure of this and the code to produce it below.
#question 5 answer# x=c(seq(5,21,length=2000),5, 7,9,11,13, 15,17, 19, 21) x<-sort(x) y=dnorm(x,mean=13,sd=2) plot(x,y,type="l", xaxt="n", xlab="Standard Deviation", ylab="Probability", main="N(13,4)") axis(1, at = c(5, 7,9,11,13, 15,17, 19, 21)) ## coloring in the part of desire## x=seq(7,9,length=10000) y=dnorm(x,mean=13,sd=2) polygon(c(7,x,9),c(0,y,0),col="light blue")
NOTE: If you are a statistian/mathematician/etc., it will be important to be able to write down this statement mathematically. It’s not difficult, it just might be something you have not seen before. This may or may not be necessary for you to know this instant, however, it will be helpful to at least see it. To write down the chance, or probability, that an observation falls between 7 and 9, we write the following:
The Central Limit Theorem
The Central Limit Theorem (CLT) is a fundamental theorem for statistics. It is provided below.
Central Limit Theorem – the mean of a sufficiently large number of independent and identically distributed samples will be approximately normally distributed.
In other words, if we have many means from samples that we have taken and they were taken from a simple random sample, then the means will be normally distributed. Believe it or not, this is a powerful and important theorem. It is not often that we might be able to know the distribution of a statistic with only a few observations.
Also, it is important to know that the standard deviation is often called the standard error.
Question V
Suppose that chemists are measuring a special statistic used in producing beer called the B statistic. Assume that those beers with similar values of the B statistic have similar tastes, smells, and smoothness. This is important as you would want to produce beer with a consistent overall experience.
Suppose that brewers at Billy’s Beirhause are brewing an experimental beer, and they have no idea what the B statistic would be for this beer. The following 30 means are each from 20 simple random samples from the 30 canisters of the experimental beer.
Barrel | Mean |
1 | 4.0046 |
2 | 3.9944 |
3 | 3.9904 |
4 | 4.0079 |
5 | 4.0122 |
6 | 3.9862 |
7 | 3.9976 |
8 | 3.9959 |
9 | 3.9847 |
10 | 3.9886 |
11 | 4.0078 |
12 | 3.9700 |
13 | 4.0078 |
14 | 3.9951 |
15 | 4.0250 |
16 | 4.0094 |
17 | 4.0109 |
18 | 4.0287 |
19 | 3.9857 |
20 | 3.9936 |
21 | 3.9864 |
22 | 4.0037 |
23 | 4.0143 |
24 | 3.9829 |
25 | 3.9861 |
26 | 4.0048 |
27 | 3.9903 |
28 | 4.0066 |
29 | 4.0059 |
30 | 3.9938 |
What can we say on the distribution of the new experimental beer? What is the special name for the distribution’s standard deviation called?
Answer V
Since the sample means were simple random samples and the statistic we are looking at is the mean of B for each sample, the CLT applies here. Therefore, we will be finding the mean of the means and the standard error.
Using the equation of the sample mean, we found that the mean of the means was
Using the equation of the sample standard deviation, we found that standard error was
We provided the code to perform this in R below.
vec<- c(3.9944, 3.9904, 3.9904, 4.0079, 4.0122, 3.9862, 3.9976, 3.9959, 3.9847, 3.9886, 4.0078, 3.9700, 4.0078, 3.9951, 4.0250, 4.0094, 4.0109, 4.0287, 3.9857, 3.9936, 3.9864, 4.0037, 4.0143, 3.9829, 3.9861, 4.0048, 3.9903, 4.0066, 4.0059, 3.9938) ## putting the data into a vector SUM<-sum(vec) SUM/30 ##this is the mean of the means using the formula mean(vec) ## this is the same but done using the mean() function for R sd(vec) #finds the standard deviation of the means or the standard error #an equivelant way to calculate the standard deviation would be in the following way SErr<-(vec-mean(vec))^2 (sum(SErr))/29
QQ Plot
QQ Plots are used to assess the normality of a set of data. In other words, they are used to see if the data is normally distributed. We will not discuss how to do this with paper and pencil or with a given formula. It is beyond the scope of this lesson. However, it is important to check if your data is normally distributed. We cannot apply a lot of the lessons to come if we do not know if the data is normally distributed. Therefore, we will use R to do this for us.
In R, the command qqnorm() will perform a QQ Plot for us. If a sample is perfectly normally distributed, it will be a straight line as shown below. Notice that the ends do have a slight imperfection. This is okay, and these data points would still be considered normally distributed.
Plots that are too far off from being a straight line are not considered to be normally distributed. However, one must make that determination. There is no exact qualification for this. The Shapiro-Wilk test provides a single number that is more easily interpretable, but it is beyond the scope of this introduction to statistics.