Standardization and z Scores
We sometimes hear that problems are relative. Problems that we face can only be accurately compared in regards to their relativeness towards one another. For example, stealing gum and cheating on a final exam are both considered wrong. However, one can argue that within the realms of stealing, stealing gum is rather minor, while cheating on a final in the realm of cheating is more severe in that particular realm. But how to we compare the differences in severity between the two actions that we described? In statistics, we do this by means of standardization. One of the best tools that we have to further analyze this standardization for normal distributions is by means of z scores. This is not the only use of z scores. We will describe the uses of z scores as we continue in our discussion.
Standardization
Standardization is simply standardizing sets of data in a way that they can be compared to one another. Mathematicians and statisticians might refer to standardization as a transformation. Standardization is the first step for us to compare data. If we wanted to compare the two observations with the same values from two different sets of data, we would first need to standardize the data. For example, if we wanted to analyze two cells, one from a data set believed to be from brain cancer cells and another from a data set believed to be from prostate cancer cells, we would first need to standardize the data.
For introductory purposes, we would want to have the data mean centered. This means that we want our data to be relative to the mean of the data. To do this, we simply subtract the mean from each observation.
Next, we would want to divide each difference by the standard deviation of the data of interest. This process is the entirety of standardization. While there are other ways to standardize data, we will be primarily concerned with this method for the time being. We have summarized this standardization process below.
Process
1. Subtract each observation by its mean.
2. Divide the difference by the standard deviation.
This can be summarized in one step as the following:
Z Scores
A z scores is simply the standardized values that an observation has while assuming the data is normally distributed. Specifically, z scores come from a standard normal distribution. In other words, a . All normal distributions can be converted from their original distribution values to a standardized normal distribution. Z scores can be helpful when we want to compare extreme values from different distributions that have greatly different values.
To calculate a z score for an observation, use the following formula:
Notice that this is the exact same standardization procedure as stated above (seriously, I just copy and pasted it). To standardize all of your observations, simply do this for all of your data.
Let’s go over some examples to get a better understanding of how this works.
Example I
Suppose that I have a population with the following values. The mean of the population is 13 and has a standard deviation of 2. Assume that is comes from a normal distribution. Calculate the z scores of each of the data points. Draw a picture where each z score is on a .
Table 1 |
|
Observation | Value |
1 | 15.39654 |
2 | 13.74751 |
3 | 12.59516 |
4 | 10.87089 |
5 | 11.63320 |
Answer I
The z scores were calculated using the following:
Using the above formula, the first observation’s z score was
The second observation’s z score was
Similarly, the rest of the z scores for the third, fourth, and fifth observations were
A picture of the standard normal with these values is shown below. Some of the z scores are written in the unorthodox position of above the x axis. This was done for clarity. Additionally, the z scores were rounded for simplicity.
The code to do this in R is shown below. Below that is the output in R. The data and code can be found here.
#calculating z scores load('Standardization and z Scores Data.RData') #loading data into R ls() # seeing objects from .RData file example_i #displaying object to confirm that it is correct object zscores<- (example_i-13)/2 #using z score formula to calculate all z scores; saves all into a new vector zscores #display z score values #drawing picture with z scores x=c(seq(-3,3,length=50),1.1982683, 0.3737539, -0.2024195,-1.0645557,-0.6834012) #create list of values from -3 to 3 including z score values x<-sort(x) #sorting values y=dnorm(x,mean=0,sd=1) #creating values for normal distribution pdf line plot(x,y,type="l", xaxt="n", xlab="z Scores", ylab="Probability", main="Example 1's z Scores") #plotting picture; creating labels; removing default x axis tick marks axis(1, at = c(1.20, -0.20, -1.06)) #adding 3 z scores below x axis axis(3, pos=.0001, at = c( 0.37,-0.68)) #adding remaining 2 above x axis
p Values
Sometimes it can be difficult to interpret a z score or other statistics. Over time, one may be able to develop an intuition on what these statistics mean, but many times p values initially help. A p value is the probability of observing a value or more extreme. For example, if we had a normal distribution with a mean of 0 and standard deviation of 1 and we observed a value of 3, the p value tell us the probability of observing 3 or more extreme. There are 2 possible circumstances on what values we would consider. The first is when we would only consider values equal to or greater than 3 such as 4, 5.3, 8, and so on. The second is when we will consider those equal to and greater than 3 but also all those values multiplied by negative 1. That would mean that we would also consider those values equal to and less than -3 such as -3.7, -5, -13 and so on. Thus, the second case would consider values such as 4, 6, -7, -13, 25, 3, and so on. In the next section, we will go over when we will consider each case.