Applied Statistical Tutorial:
Overview:
This tutorial is an outline of statistics concepts and its relevant terms. We will express and clarify the basic statistical terms and phrases then identify the purpose of studying statistics. Our object is to take an applied approach of statistics although we have a glimpse of basic statistics in the first chapter to help you having a general conception of statistics. We classified this tutorial into three chapters as follows:
- Introduction to Statistics
- Descriptive Statistics
- Inferential Statistics
Introduction to Statistics:
Suppose you wish to make an informed decision. It could be a restaurant that provides high quality food and environment, a place for vacation, a perfect school for your children, or a good brand for dietary supplement. You might consult this decision with your friends, families, acquaintances, or even people you might not know. You might go online and google it or see online reviews to get more information about the subject. After all, you organize the obtained information and after processing then you make the decision. Indeed, unknowingly you conduct a small statistical survey and apply other’s information to make a decision.
Statistics is used in many aspect of life and has a wide spectrum of applications. It is a significant science and has a meticulous methodology that is used in almost all applied sciences such as medicine, psychology, economics, and actuarial science. Statistics Canada (StatCan) is Canada’s national statistical agency and a trusted source of information who conducts census every five years. According to StatCan “Our mission: Serving Canada with high-quality statistical information that matters.” (https://www.statcan.gc.ca/eng/about/about?MM=as)
In daily report, or news about politics, health, education, economy and so on, you might encounter a statistical sentence or conclusion. For example, “Peanut allergy in Canada affects about 2 in 100 children.” ( http://foodallergycanada.ca/) or
“People with cystic fibrosis are living to a median age of 50.9 years in Canada, compared with 40.6 years in the U.S., according to research published in the Annals of Internal Medicine.” (http://www.cbc.ca/news/health/cystic-fibrosis-survival-rates-1.4022970)
The science of statistics always deals with numbers or figures, which play an essential role in any statistical project. Statisticians make decisions and predictions based on evidence that is called data, so the underlying basis of statistics is data. Data are the information represented by numbers or words to expand our knowledge in order to make a decision or interpret a problem. Statistics show how to collect and organize data and describe how to analyze and interpret them in a meticulous methodology.
Collecting Data
In the beginning, we have raw data which is collected using different methods. These methods are as follows:
Data collection methods
- Survey: a survey can be conducted in either a population (census survey) or a sample (sample survey) through an interview, a satisfaction survey, or a questionnaire.
- Observation: an observational study is used to merely observe individuals and record data, with no interference or control.
- Experiment: an experimental study is also used to record data; however, in this study the individuals and treatments are controlled by a researcher.
The raw collected data is not particularly useful by itself, but the science of statistics enables us to draw useful information from them and understand how they behave. Then we apply this information to solve a problematic part of statistics that enables us to interpret the data for any reason, such as making decisions, testing hypotheses, making predictions and so on. Indeed, we apply statistics to legitimize a claim or test the efficiency of new products or make a prediction. Statistics usually is categorized into two ways.
- Descriptive Statistics that illustrate collecting data, organizing them in graphs or tables, and summarizing them into numbers (Mean, Median, Mode).
- Inferential Statistics that apply descriptive statistics in order to do interpretation or prediction of data.
In order to collect the data which is the first step in the procedure, we have to specify the boundaries of research which indicates where the data come from. For each statistical research, there is a population which embraces the collection of all possible individuals who might being studied. Suppose we are interested in studying about diabetes in Canada. It is obvious that the population is people who live in Canada. However, in the following report, the population is changed:
“In 2014, 6.7% of Canadians aged 12 or older (2.0 million people) reported that they had diabetes.” (http://www.statcan.gc.ca/pub/82-625-x/2015001/article/14180-eng.htm) In this report, the population is Canadians who are 12 years or older in 2014. Usually it is not applicable studying the entire population (called census) due to cost/expenses, time consumed, and errors because of a large amount of data. therefore, we specify a sample which is a group of individuals (a subset of population) selected from the population and study, or observe all the members of that sample to produce the data. The size of population and sample respectively are denoted by “N” and “n”.
We are interested in one or more distinctive characteristics, or features, in the individuals of the population called variables. The initial definition of variable is something that is “inconsistent or not having a fixed pattern”. The variable is a factor that is changeable and takes different values. We define variable as a distinctive characteristic or feature, number, or quantity that can be measured or counted, and assumes different values from one observation or individual to another such as age, sex, number in the family or income.
When regarding variables, we define data as the values of the variables that we obtain from measurement or observation of variables. Therefore, in order to measure variables, we apply different scales of measurement based on the nature of the variables. Some variables are not inherently numeric, such as sex or level of pain. However, some variables are measured with some specific tools precisely and inherently numeric, such as weight or temperature. We have four scales of measurement as follows:
- Nominal
This scale is used to merely name different levels of the variable, in order to differentiate between them. These names could be numbers, words, or letters. If we assign numbers to the levels of the variables, the numbers have no meaning, superiority, or order. That means, we do not consider the magnitude of the numbers. For example, “Sex” which has two levels: “Females” and “Males”. If we assign the number of “0” to the males and the number of “1” to the females, these numbers are only applied to distinguish which observation is male or female and no more.
- Ordinal
This scale is used when levels of a variable have logical order or discipline. The order between the levels is intuitive. For example, levels of pain with the levels of no pain, mild pain, moderate pain, and severe pain. We might assign the numbers of 1 to 4 to each level respectively. By increasing the number, the level of the pain increases and vice versa. Therefore, theses numbers are applied to identify the levels and have order. However, the intervals between the levels can not be quantified and basic mathematical operations on these numbers are not appropriate.
- Interval
This scale measures the variables which are inherently numeric, hence the generated data from these variables are numbers. Usually we use specific tools to measure these variables. The values of numbers are meaningful and the intervals between the values are measurable and equal for two consecutive (adjacent) numbers. For example, Fahrenheit or Celsius temperature scales or time. The interval between 60 and 61 degrees is equal to the interval between 65 and 66. In addition, the interval between 65 and 75 degrees is 10 degrees. However, the interval scales do not have a meaningful or true zero which means the number of zero does not indicate the absence of the characteristic. That means, zero-degree does not mean there is no heat. Multiplication and division are not appropriate in this scale of measurement.
- Ratio
This scale is the perfect scale and has all the properties of the scales mentioned above. In addition, it has true zero, which indicates the absence of the characteristic. Basic mathematical operations are appropriate on the numbers when this scale is applied. For example, weight, height, and number of children.
If in research, there is a sample of seven people that we wish to study concerning their sex, family size, and weight. The result of the survey is collected in the following table:
ID | Sex | Family Size | Weight |
1 | Male | 1 | 147.3 |
2 | Male | 2 | 112.5 |
3 | Female | 3 | 168 |
4 | Male | 4 | 153 |
5 | Female | 1 | 170.8 |
6 | Female | 4 | 138 |
7 | Female | 5 | 154 |
In this survey, the variables (shown in red) are sex, family size, and weight with different values for each observation. For the sex variable, the data obtained is a word (male or female) as opposed to the family size and weight which are numbers. For the family size variable, the data is the whole number as opposed to weight that can be a decimal. Therefore, there are two categories of variables which are called qualitative and quantitative variables, or we can use the term categorical and numerical respectively.
Qualitative (Categorical) variables are measured on nominal or ordinal scales and can be divided into distinct categories based on the corresponding scales. Therefore, these variables can be further categorized as nominal or ordinal variables. For example, sex is a qualitative nominal variable.
The nominal variables with two levels such as sex are called dichotomous variables.
Quantitative Variables are measured on interval or ratio scales, such as height, family members and age. Quantitative variables can be classified into two groups, which are continuous and discrete variables.
Discrete Variables take a countable number of values, for example, the number of children in a family. These variables take finite values between two values and take any value between two consecutive values. That means, the data are generated by these variables are whole or integer numbers.
Continuous Variables can take an infinite number of values between any two specific values, for example, height or weight.
To run any statistical research, we need to specify our population and use a variable that we are interested in testing. After that, we assess the size of population (which is denoted by N). If it is so large that we are not able to measure all the population, we need to choose a sample (denoted by n) of the population and measure variables to generate data. Thus, we try to estimate characteristics of the population using the sample.
Suppose we want to compare the serum cholesterol levels of fifty-year-old females and males in Canada. In this study, the population is all fifty-year-old people who reside in Canada. The variables being studied are the serum cholesterol levels which is a continuous quantitative variable and sex which is a dichotomous qualitative variable. However, we are usually not able to study all the population due to time constraints, high expenses, ethics, and so on. Therefore, we try to find the best sample in terms of size, efficiency, time available, and expenses. That sample should be a real representation of the entire population that enables us to generalize the result. Suppose we consider using 80 people as a sample. If after measuring the sample, we have 80 numbers that represent the serum cholesterol levels of the sample called dataset and 80 words (female or male) that represent the sexes of the sample, we can organize these data into a table or graph that gives us more information about the values of data, the number of occurring of each data point, and how data are distributed.
Frequency Distributions and Graphs
After gathering data for the specific variable, the next step is organizing these raw data by constructing a frequency distribution or by generating graphs. Sometimes it is more convenient to represent the data visually by graphs that enable us to interpret the distribution or pattern of the data instantly. We try to demonstrate the distribution of data by generating statistical charts and graphs such as histograms, pie charts and stem-and-leaf plots or frequency tables.
Two types of frequency distributions that are most used are the categorical and grouped frequency distribution.
Frequency distribution table:
When the range of data is large, the data must be grouped into classes that are more than one unit in width. We are going to construct a frequency table for data and classify them into classes.
For each class, we have lower and upper class limits, which are where the classes start and end.
Class Width is the difference between the two consecutive lower class limits or upper class limits. Then we determine the number of the classes.
It is preferable to have between 5 and 20 classes (not too few groups nor too many). It is recommended to calculate the smallest value for “k”, using the formula: 2k>n where “k” is the number of classes and “n” is the sample size. To explain the following terms, we use the letters “H” the highest value, “L” the lowest value, and “R” the range.
H=Highest value, L=Lowest value, R=H-L (the range), Width=Rk
Example 1: The following data represents the number of children in a random sample of 50 rural Canadian families.
(Reference: American Journal Of Sociology, Vol. 53, 470-480)
11 | 2 | 9 | 3 | 9 | 6 | 9 | 5 | 14 | 5 |
13 | 5 | 2 | 3 | 4 | 0 | 5 | 2 | 7 | 3 |
4 | 0 | 5 | 4 | 3 | 2 | 4 | 2 | 6 | 4 |
14 | 0 | 2 | 7 | 3 | 6 | 3 | 3 | 6 | 6 |
10 | 3 | 3 | 1 | 2 | 5 | 2 | 5 | 2 | 1 |
The table above is not informative so, we organize the data with the following SPSS procedure to change it into the frequency distribution table.
First, we need to find “R” (range) that can be computed by SPSS using the following procedure: Go to SPSS,
- Define your variables in Variable View window.
- Enter your data in the Data View window.
- Analyze → Descriptive Statistics → Descriptives (move the variable into the right box) → Option (check Range, Maximum, and Minimum) → Continue → Ok
Output:
Descriptive Statistics | ||||
N | Range | Minimum | Maximum | |
Number of children | 50 | 14.00 | .00 | 14.00 |
Valid N (listwise) | 50 |
With the recommended formula for “k” (2k>n) we find the smallest “k” that 2k>50 is 6 ( 26=64>50). Therefore, we have R=14, n=50, and k=6, then width=14/6=2.3 and we usually round up the figures which gives us width=3 (since data are discrete we can not use decimal numbers). Afterwards, we use the following SPSS procedure to find the frequency table:
- Transform→ Visual Binning (move your variables to “Variables to Bin”) → Continue→ (choose a new name for your variable into “Binned Variable”, in “Upper Endpoits”, choose your option) → Make Cutpoints (First Cutpoint Location=L or L-1, Number of Cutpoints=number of classes then Width that is filled automatically) → Apply→ Make Lables→ Ok
- Analyze →Descriptive Statistics →Frequencies (move the new variable into Variables box) → Charts→ Histogram → Continue.
(in this example, the First Cutpoint Location=-1, Number of Cutpoints=6, and Width=3)
Output:
Number of children (Binned) | |||||
Frequency | Percent | Valid Percent | Cumulative Percent | ||
Valid | .00 – 2.00 | 14 | 28.0 | 28.0 | 28.0 |
3.00 – 5.00 | 21 | 42.0 | 42.0 | 70.0 | |
6.00 – 8.00 | 7 | 14.0 | 14.0 | 84.0 | |
9.00 – 11.00 | 5 | 10.0 | 10.0 | 94.0 | |
12.00 – 14.00 | 3 | 6.0 | 6.0 | 100.0 | |
Total | 50 | 100.0 | 100.0 |
The table above is more informative and gives us information about the data. For example, this table enables us to state that 6% of families have more than 12 children. We can also demonstrate the data in visual or graphical form. That gives us a quick idea about the pattern of the data. Following are some of the most commonly used statistical graphs:
Bar chart (mostly used for categorical data), pie chart (mostly used for categorical data to show the percentage or proportion of each data), histogram (mostly used for quantitative continuous data), stem-and-leaf plots (used to show how the data are spread), frequency polygon (is a line graph and used to show the shape of distribution of the data to diagnose the skewness and kurtosis), and ogive (or cumulative frequency graph is a line graph and used to represent the cumulative frequency or percent for each data point or class).
For the example 1, with the following SPSS procedure we generate the bar chart.
SPSS:
Analyze →Descriptive Statistics →Frequencies (move the variable into Variables box) → Charts→ Bar charts→ Continue.
With a quick glance at this bar chart we see that most of the families have 2 or 3 children and fewer families have more than 10 children. The distribution of the data is not symmetrical since the left side of the bar chart has more data.
With the following procedure, we generate the frequency polygon and ogive respectively, which show the shape of the data, making it even more understandable.
Frequency Polygon:
Graphs → Chart Builder → Line (drag the simple one into the top (chart preview uses example data box)) → drag the variable into the X-Axis box → Ok
Ogive:
Graphs → Chart Builder → Line (drag the simple one into the top (chart preview uses example data box)) → drag the variable into the X-Axis box → Element Properties (in “Statistic”, click on “Histogram” and choose cumulative Count → Apply → Ok
Output:
Frequency Polygon Ogive
The frequency polygon above gives quick information about the shape of the distribution of the data in terms of symmetry and flatness.
The data above is discrete, although for continuous data we can use the same SPSS procedure. However, in order to eliminate the gap between the upper limit of a class and the lower limit of the next class, there is one more step that you can find in statistics books.
Frequency distributions table (Categorical data)
It is used for nominal or ordinal data such as blood types, gender, or ranking of universities. In this case, we deal with letters or words, therefore in order to draw information from them and make them more understandable, using the SPSS we label them using a number.
Example 2: Suppose we study 50 patients with heart disease and the following fictitious data is their blood type. We are not able to interpret and obtain information from these data, therefore we organize them in the table which follows.
AB O B O O B B A A B AB B B O O O O B B O AB A AB B B O O O AB A B B AB O O O B AB B AB O O O A B B AB O B
We can use SPSS to construct a frequency distribution table. First, in the “Variable View” define the variable which is “Bloodtype”. Then in the “Values” column, use a numeric value for the different levels of the variable. For example, we assign the values of “1, 2, 3, and 4” to type A, type B, type O, and type AB respectively. For instance, in the “Value” box, type “1”, and in the “Label” box, type “A”. Therefore, the SPSS will consider “1” as blood type A. Then, in the “Type” column, check “String”, since the data is nominal. Finally, in the “Data View” enter the data.
SPSS:
Analyze →Descriptive Statistics →Frequencies (move the variable into Variables box) → Charts→ Bar charts or Pie charts→ Continue.
Output:
Bloodtype | |||||
Frequency | Percent | Valid Percent | Cumulative Percent | ||
Valid | A | 5 | 10.0 | 10.0 | 10.0 |
B | 18 | 36.0 | 36.0 | 46.0 | |
O | 18 | 36.0 | 36.0 | 82.0 | |
AB | 9 | 18.0 | 18.0 | 100.0 | |
Total | 50 | 100.0 | 100.0 |
The frequency table and bar chart above give us more information than the raw data. The table indicates, for example, 36% of patients have either type O or type B blood type.
Histogram
Histograms are the most common statistical charts (diagrams) that represent the distribution of the variables. The patterns of the dataset that are represented by a histogram can be symmetric, skewed right, or skewed left. We represent a continuous dataset by a histogram and a discrete dataset by a bar chart, and generate statistical charts by SPSS with the following procedure:
Graphs → Legacy Dialogs
Summarizing Data:
So far, we have been dealing with massive amounts of data in order to drive important information by organizing them in frequency tables and graphs. However, sometime these graphs or tables are not sufficient to describe the data set or we might need a numerical measure to be more precise about our description. Therefore, we try to demonstrate the entire sample or even a population by a single number. Indeed, we try to identify a good representative of the data set which explains the values of data points, the shape of the distribution, the variability, and the location of the data. The measure which is applied to summarize and describe a characteristic of the population is called Parameter. Likewise, when it is used to describe the characteristic of a sample, it is called Statistic.
Measures of Central Tendency
We can summarize the data by concentrating all the information in some specific numbers in order to describe the entire dataset. These numbers are called the measures of central tendency which represent the center of the data distribution as well. Indeed, these single numbers are more applicable than the frequency of table or even the graphs, especially when we wish to compare two or more populations in terms of the variables. We are not able to compare the entire data of the population or the sample, however, we can compare their representatives and generalize our answer to the entire population. That enables us to make decisions faster with less cost. Mean (x), Median (MD) and Mode (MO) are three main measures of central tendency that can approximately describe the whole set of data or population. Although there are these three measures, we can use the most appropriate ones depending on the data.
The Mean
The most important and frequently used measure of central tendency is the Mean. In another definition, the Mean is the average of a dataset and it is reliable since each data has influence in this number. Suppose you run a business that each day has a different profit. If you wish to describe your income for a week it does not make sense to mention each day’s profit. Therefore, instead of representing the seven numbers you may represent your income during a week by the average of these numbers. That enables everybody to have the estimation of your weekly income. Furthermore, you are able to compare your income weekly. Consider example 1 which had 50 data points. These were not informative before constructing the frequency table. If we wish to compare this data of the 50 families to their US counterparts, we can not compare all the data. Indeed, we compare their representatives which are the Means. The Mean for this data is 4.7≈5 (since the data is discrete and 4.7 children is meaningless). That means the average of the number of the children for these 50 families is 5. In another way, 5 is a Mean of the 50 data. The data with more frequency and high value has more influence in the Mean. The following formula is used to calculate the Mean.
Sample Mean: x=x1+x2+x3+…+xnn=xn , where n is the sample size
Population Mean: μ=X1+X2+X3+…+XNN=XN , where N is the population size
Population Mean is usually an unknown parameter and is estimated by the sample Mean (point estimate). The formula mentioned above is easy to calculate. The Mean is a reliable representative of the dataset since it consists of all values of the data. However, there are some disadvantages for the Mean. The Mean is affected by all the data which is not always good. If there are very large or very small data, that causes the value of the Mean to increase or decrease respectively. That means if there is an extreme value or data (outlier) in our dataset the Mean is higher or lower than usual. In this case, the Mean is not a proper representative of all the data, and we can not trust it. Therefore, in case of having outliers in the dataset, or skewed data we prefer to use the Median. The Mean is not used for qualitative data.
The Median
The Median is the midpoint (middle value) of the ordered data denoted by MD. In other words, exactly half of the data have values greater or equal than the Median or less or equal than the Median. Recall example 1 again. The Median for this dataset is 4. That means half of the families have 4 or fewer than 4 children and half of the families have 4 or more than 4 children. This measure of central tendency is not sensitive to outliers since the value of all the data is not used in the calculation. In order to calculate the Median, first we sort the data in ascending or descending order. The middle value is the Median if the number of values in the dataset is odd. If the number of values in the dataset is even, the Median is the average of the two middle values.
The Mode
The most frequent occurring value of data in the dataset is the Mode or the data with the highest frequency. The Mode can be used when the data is coming from a nominal or an ordinal variable. The Mode is not necessarily unique for each dataset, which is sometimes confusing. Datasets with two Modes are called bimodal, and with more than two Modes are called multimodal. When all the data occurs with equal frequency, the dataset has no Mode. It can be easily found in the bar chart or histogram, in which the highest bar represents the Mode. We usually use the Mode when we have nominal data since the Mean and Median are not appropriate.
Measures of Variability (dispersion)
When we have quantitative variables in addition to measures of central tendency, we must consider measures of dispersion. These measures indicate how much data are scattered around the measures of central tendency. In order to describe our dataset with measures of central tendency, we calculate the measures of variability to indicate how well the measures of central tendency describe the dataset. In other words, these measures identify the reliability of the measures of central tendency.
Suppose a teacher wants to compare the math scores for two different groups of students.
The following data represents the scores for each group:
Group 1: 71, 72, 64, 78, 73, 69, 76, 72, 67, 70 x1=71.2
Group 2: 100, 73, 81, 68, 98, 91, 10, 100, 21, 70 x2=71.2
The Mean for group 1 is the same as the Mean for group 2, although this does not mean that both groups are at the same level of strength in math. For group 1 “71.2” is a proper representative since almost all the dataset is not very far from the Mean or even from each other. The variability between the values of data in group 1 is not high, therefore the Mean is more reliable. On the other hand, the values of data in group 2 vary from one data to another. Therefore, the Mean is not a perfect representative since there are two weak scores (“10” and “21”). In conclusion, a dataset with more consistency has reliable measures of central tendency and that is the most important reason to calculate the variability of a dataset.
- Range
The simplest measure in terms of calculation, is the range which is the difference between the largest and smallest data denoted by R. For the example above, each group’s range is R1=14 and R2=90 respectively. Comparison between two ranges indicates that variation between the data values in group two is considerably higher than in group one. This measure in addition to being simple, can show us whether we have outliers or odd data. However, since only two data influence it, we look for other measures that apply to the entire dataset.
- Standard Deviation
The most frequently used measure of variability is standard deviation. This measure shows the spread between the data values more precisely since it takes the entire data into account. The value of the standard deviation depends on the variability between the dataset. The more variability or spread between data, the higher standard deviation and vice versa.
The Sample Standard Deviation: s=s2= (x-x)2n-1
The Population Standard Deviation: σ=2= (X- μ )2N
- Variance
The variance is the squared standard deviation; therefore, it is denoted for the sample and the population respectively by s2 and ơ2. The variance in fact, is the average of squared deviations of each data point from the Mean.
- Interquartile Range or IQR
The measures of variability are mentioned above are affected by each data point which could be problematic in case there are extreme data in the data set. The interquartile range is a trimmed range which dose not consider the top and bottom 25 % of data distribution. IQR is the difference between Q1 and Q3 and mostly used to identify outliers (Q1, Q2, and outliers will be explained later).
SPSS:
We can compute these measures by SPSS.
Analyze→ Descriptive Statistics → Frequencies (move the variables) → Statistics (check central tendency or Dispersion) → Continue
Example 3: The following data indicate the fasting blood glucose for 35 randomly selected adult people. We want to find the measure of central tendency and variability with SPSS.
94.82 85.06 80.16 90.31 100.99 108.95 85.77 79.96 91.62 86.05 89.20 81.16 84.68 90.42 97.59 79.59 88.81 95.34 90.10 89.17 104.24 101.76 93.55 101 71.97 92.14 106.90 78.52 82.53 84.06 91.65 75.58 107.69 94.28 83.52
First, we start with our quantitative continuous variable which is Blood Glucose levels and define that in the variable view window, then enter the data in the data view window, and finally use the procedure above. The following table shows measures of central tendency and measures of variability.
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Mean | 90.2594 | |
Median | 90.0975 | |
Mode | 71.97a | |
Std. Deviation | 9.33174 | |
Variance | 87.081 | |
Range | 36.98 | |
Minimum | 71.97 | |
Maximum | 108.95 | |
Sum | 3159.08 | |
a. Multiple modes exist. The smallest value is shown |
There are two ways used to generate the histogram and display the normal curve for the data.
- Graphs → Legacy Dialogs (choose histogram and move the variable to the right and you can check Display normal curve)
- Graphs → Chart Builder → Ok → select Histogram → double click on the simple histogram (the left one) → drag the variable (Blood Glucose Level) into the X-Axis → check the Display normal curve on the right side → Apply → Ok
Output:
Measures of Shape
- Skewness
As we mentioned, the statistical graphs and charts enable us to distinguish the pattern or shape of the distribution. Consider the following histograms of three different types of data distribution.
We can see the shape of data distribution in “Figure 3” is almost symmetric as opposed to “Figures 1” and “2”. In this figure the Mean, Median, and Mode are approximately equal or close. However, In the Figures (1 and 2) we see skewness either to the right or left. The measures of central tendency are not close. In “Figure 1” the Mode is the smallest value, since the data with low value has more frequency. The Mean is dragged toward the right tail and the Median is between them. But, it is the opposite in “Figure 2”; the Mode is higher than the Mean and Median. The Mean is the smallest measure that is dragged toward the left. Therefore, in case of having skewed data, the Median is the best representative.
Frequency Distribution Shapes
There are three important shapes which are positively skewed, symmetric, and negatively skewed.
- Positively Skewed (Right Skewed) Distribution (skewness is positive)
In this case, the Mean is the largest measures, the Median and the Mode come in order. That means the majority of the data values are smaller than the Mean. In this shape, the tails are to the right (Figure 1).
- Symmetric (skewness is zero)
In this distribution, the three measures of central tendencies have almost the same values. That means the data values are distributed evenly on both sides of the Mean. (Figure 3)
- Negatively Skewed (Left Skewed) Distribution (skewness is negative)
This is the opposite of number one. The Mean is the smallest measure and the Median and Mode are respectively greater than the Mean. That means the majority of the data values are greater than the Mean. In this shape, the tails are to the left (Figure 2).
Skewness is a numeric measure of asymmetry of a data distribution about its Mean which is calculated by a statistic software such as SPSS and can be interpreted as follows:
- The distribution is highly skewed when the value of skewness is less than “-1” or greater than “1”.
- The distribution is moderately skewed when the value of skewness is between “-1” and “-1/2” or between “+1/2” and ”+1”.
- The distribution is approximately symmetric when the value of skewness is between “-1/2” and “+1/2”.
The skewness calculated by SPSS is the sample skewness which indicates the skewness of the sample, not necessary of the population. We can not always generalize the result to the population as a skewed sample might come from a symmetric population or vice versa. (4)
- Kurtosis
Kurtosis is a measure of the shape of a dataset which describes the peakedness or flatness of the distribution. The curve can be normal, or flat, or peaked. The distribution of the data is called:
- Mesokurtic is a normal curve when the value of kurtosis is “0” (Bell curve). (excess kurtosis, kurtosis – 3)
- Platykurtic when the value of kurtosis is less than “0” and the curve has shorter and thinner tails and a lower and broader peak. (see the following diagrams)
- Leptokurtic when the value of kurtosis is greater than “0” and the curve has longer and fatter tails and a higher and sharper peak.
We can compute the skewness and kurtosis of the dataset with the following procedure in SPSS:
Analyze → Descriptive Statistics → Frequencies (move the variable to the right) → Statistics (check skewness and Kurtosis) → Continue → Ok
As following histogram from the example 3 shows that the data is approximately symmetric.
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Skewness | .284 | |
Std. Error of Skewness | .398 | |
Kurtosis | -.478 | |
Std. Error of Kurtosis | .778 |
The table above indicates that the skewness is between (-0.5,0.5) which means the distribution of the sample (not population) is approximately symmetric. Kurtosis is -0.478 which is less than “0” and means the distribution of the sample is platykurtic. However, to check whether these values are statistically significant, we use statistical hypothesis which we will explain later.
Measures of Position
These measures indicate the locations of any data value in relation to each one when we arrange the data in ascending order. In other words, these measures indicate where the specific data value lies in a data distribution, and explain the relative position of an individual data point within the distribution (rank-ordered data).
- Percentiles and percentile rank
These are frequently used by medical researchers, teachers, or economists in order to compare a particular individual to other observations. Many standardized exams, such as MCAT (Medical College Admission Test), release a percentile rank for each examinee along with a numeric score. Pediatricians indicate children’s growth using percentile and percentile rank, and monitor growth for children by using charts (E.g. WHO (World Health Organization) child growth standard) to determine whether they are above, under, or on the normal curve. Percentile and percentile rank are applied because the raw value is not completely meaningful by itself. Suppose, a student scored 80 on a test. Knowing the score of 80 is not sufficient to conclude how well the student performed on the test compare to the rest of the test takers. However, if we know this student performed as well or better than 40% of the other examinees, that means that the exam was not difficult, and 60% of the other students performed better than the student who got 80. Therefore, the raw score of 80 was misleading and the percentage of 40 gave us the better idea about the student’s performance among the other students.
The Kth percentile is assigned to a value at or below which the specific (K) percentage of the data in a distribution lie. A percentile rank on the other hand, indicates the corresponding percentage of the data in the distribution which fall at or below the specified value.
Therefore, the student’s score on the test is at the 40th percentile or the percentile rank for the score of 80 is 40. That means, 40% of the students received the score of 80 or less. The following example could clarify the definition.
According to “2016 Household Income Percentile Calculator for the United States” in the ‘DQYDJ’ website, “A household making $175,000.00 annually was percentile 91.6% in 2016. This percentile ranged from $174,565.00 to $175,052.00 a year.” That means that almost 92% of households made the same or less than $175,000.00, or the value of $175,000.00 is at the 91.6th percentile.
Hence, the median is the 50th percentile in which half of the values are smaller or equal to the median.
As we have defined, the percentile indicates a value and percentile rank refers to the corresponding percentage of the value. Therefore, percentiles could be any continuous number (real numbers) and percentile ranks range from 0 to 100. Percentiles are usually denoted by Pi, when the value of “Pi“ indicates the percentile and “i” indicates the percentage or percentile rank and percentile ranks are denoted by PR. For example, if a standardized exam released an examinee’s score: P80=90, it means, the examinee scored 90 and 80% of the examinees received the score 90 or less. The percentile is 90 and the percentile rank is 80 or the examinee is at 80th percentile.
For example 3 we wish to know the 75th and 95th percentile. The following procedure in SPSS is used to calculate the percentile of interest.
Analyze → Descriptive Statistics → Frequencies (move the variable to the right box) → Statistics (check the “Percentile” box, you wish to calculate and “Add”) → Continue → Ok
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Percentiles | 75 | 95.3398 |
95 | 107.9441 |
The above output indicates P75=95.3398 and P95=107.94441. That means 75% of the individuals have equal or less than 95.3398 of blood glucose level. Likewise, 95% of the individuals have equal or less than 107.9441 of blood glucose level.
If you wish to calculate the percentile ranks for each data point, the procedures are as follows:
- Data → Sort Cases (move the variable into the “Sort by” box) → ok
- Transform → Rank Case (move the variable into “Variable” box) → Rank Types (uncheck “Rank” and check “Fractional rank as %”) → Continue → Ties (check “High” (according to above definition for percentile rank)) → continue → ok
Percentile ranks appear in a new column in “Data View” corresponding to each data point.
Or the following procedure using ( PR=100×(R-0.5)/n ) formula for discrete data is more precise:
- Data → Sort Cases (move the variable into the “Sort by” box) → ok
- Transform → Rank Case (move the variable into “Variable” box) → Rank Types (check “Rank”) → Continue → Ties (check “Mean”) → Continue → Ok
- Transform → Compute Variable (1- choose a name in “Target Variable” E.g. (PercentileRank) 2- in “Numeric Expression” type: (Rvariable -0.5)*100/n ) → ok
Percentile ranks appear in a new column in “Data View” corresponding to each data point.
- Deciles
The deciles divide the frequency distribution into 10 equal groups. They are similar to the percentiles as the first decile is the 10th percentile. That means 10% of the values are equal or smaller than the first decile. The deciles are denoted by D1, D2, D3, …, D9 that correspond to P10, P20, P30, …, P90. Since we can not use SPSS to calculate deciles, we calculate the correspondences (P10, P20, P30, …, P90) instead.
- Quartiles
Q1, Q2, and Q3 represented as the first, second and third quartiles, divide the distribution into four equal groups (quarters). Each group consists of 25% of the data set. The first quartile (Q1) is the same as the 25th percentile, the second quartile (Q2) is the same as the 50th percentile or the Median (or fifth decile), and the third quartile (Q3) is the same as the 75th percentile. Recall that in the interquartile range the following is the formula and IQR of the Blood Glucose Level’s data, as well as the SPSS procedures.
IQR = Q3-Q1
IQR=95.3398-83.5247=11.8151
SPSS:
Analyze → Descriptive Statistics → Frequencies → Statistics (check “Quartiles” box)
Analyze → Descriptive Statistics → Explore (move the variable into the “Dependent List” box) → Statistics (check “Outliers”) → Continue → ok
We run the procedures above for example 3:
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Percentiles | 25 | 83.5247 |
50 | 90.0975 | |
75 | 95.3398 |
Descriptives | ||||
Statistic | Std. Error | |||
Blood Glucose Level | Mean | 90.2594 | 1.57735 | |
95% Confidence Interval for Mean | Lower Bound | 87.0538 | ||
Upper Bound | 93.4649 | |||
5% Trimmed Mean | 90.1813 | |||
Median | 90.0975 | |||
Variance | 87.081 | |||
Std. Deviation | 9.33174 | |||
Minimum | 71.97 | |||
Maximum | 108.95 | |||
Range | 36.98 | |||
Interquartile Range | 11.82 | |||
Skewness | .284 | .398 | ||
Kurtosis | -.478 | .778 |
Outliers
Outliers are extreme data values, either high or low, compared to other data values. There are two types of outliers which are mild and extreme outliers. A data value is considered as an extreme outlier when it is outside of the interval (Q1-3IQR, Q3+3IQR) whereas a data point is a mild outlier when it is outside of the interval (Q1-1.5IQR, Q3+1.5IQR).
Boxplot also represents the quartiles and outliers. In SPSS, mild outliers and extreme outliers are represented with circles and stars respectively. It is worth mentioning that we should not drop outliers without an appropriate reason, although they can affect the accuracy of our result. If the data are normally distributed, we can use Z-score, that we will explain later, to identify the outliers. A value of data point that is greater than 3 or less than -3 is labelled as an outlier.
Box Plot
Box plot is a simple and useful statistical graph which displays important information of dataset, such as the pattern or shape of the data. It also shows quartiles (Q1, Q2, Q3) as well as the maximum and minimum data by using the terms upper whisker and lower whisker respectively. On the diagram IQR, and the range as well as outliers are shown. The Median is M=Q2 is shown by the middle line. R (range) is shown at the left on the diagram. We use box plots for quantitative variables. Consider the following box plots and important parts.
As you can see, a box plot enables us to identify skewness of the data distribution as well.
Depending on whether the line representing the Mean is closer to the upper line, the centre, or the lower line, the data distribution is left skewed, symmetric, or right skewed respectively. The box plot above and the following box plots represent different patterns.
The following procedures in SPSS generate the box plot of the data.
SPSS:
Graphs→ Legacy Dialogs→ Boxplot→ Simple (check “Summarised of separate variables”) or
Analyze → Descriptive Statistics → Explore → First, move the dependent variables into the Dependent List box and then move the independent or group variable (if there is any) into the “Factor List” box → Statistics → check Outliers → Continue → ok
We run the procedures above for example 3 that shows the data distribution is almost symmetric with no outliers:
- Standard Scores (Z-Score)
Z-Score is a standardized score that indicates the position of the data from the center (Mean) within the distribution. Usually having the value of data is not sufficiently informative as it does not convey the relative position of the data point. However, the corresponding Z-Score enables us to describe the location of the data point in relation to the other data in the data set. For example, a Z-Score of “1” indicates that the corresponding data is one standard deviation greater than the Mean, and a Z-Score of “-1” indicates the corresponding data is one standard deviation less than the Mean. Therefore, the sign of the Z-Score indicates whether the data is greater or lesser than the Mean, and the value indicates the distance of the data point from the Mean, using the standard deviation as a unit. We are also able to apply standard scores for comparing two data values from two different datasets and distributions.
For example, we have two separate groups of males and females who have heart disease. If we wish to compare a female’s serum cholesterol with a man’s, first the two data values should be standardized, then compared. That means we take the other data values from a corresponding population and variation into account. As mentioned in the explanation about Outliers, standard scores can be applied to determine outliers when the data are normally distributed . The following formulas are used for calculating Z-Scores when we work with entire population and sample respectively:
z=x-μ , z=x-xs
when Z is Z-Score, x is the value of data, is the population Mean, and is the population standard deviation, x and s are the sample Mean, and the sample standard deviation respectively.
Suppose a student wishes to compare his math score with his French score. If his score in math is 67 and in French is 69, he has to standardize the two scores before comparing them, since each score comes from different distributions. In other words, we are not supposed to compare the raw data from different datasets without considering variations of the data. We can not say the student did better in French since 69>67. Suppose the Mean and standard deviation of the math class are 69 and 10 respectively, and the Mean and standard deviation of the French class are 75 and 11 respectively. The followings are the Z-scores of math and French scores.
Math: z=67-6910=-0.2 and French: z=69-7511=-0.54
As we compare the two Z-scores (-0.2>-0.5), we can see that the student did better in math considering other students’ scores.
The Z-Score is also used to identify any outliers when the distribution is approximately normal. If the standardized data value is less than -2.68 or greater than 2.68, it is labeled as an outlier. The following procedure in SPSS generates the Z-score of the data values.
Analyse → Descriptive Statistics → Descriptives (move the variable into the “Variable” box and check Save standardized values as variables) → ok
The Normal Distribution (bell curve)
The most important and frequently used statistical and continuous distribution is the normal distribution, in terms of underlying assumptions for many statistical tests. In addition, many human characteristics are normally distributed such as weight, height, blood pressure and so on. Medical scientists study the normal ranges of these characteristics, so physicians determine whether to have more investigations for patients who have the values out of these ranges or not. Furthermore, the normal distribution enables us to calculate the probability of having data points less or greater than a particular value, or even between two specific values.
Suppose the following fictitious data indicate the total serum cholesterol levels (mg/dl) of 15 males.
146.11, 123.08, 265.27, 250.37, 291.5, 197.36, 160.56, 224.86, 240.68, 253.66, 224.28, 213.15, 297.7, 260.2, 223.78
First, we generate a histogram (Figure 1) to determine the shape of data as well as measures of central tendency.
Output:
Statistics | ||
Serum Cholesterol Level | ||
N | Valid | 15 |
Missing | 0 | |
Mean | 224.8373 | |
Median | 224.8643 | |
Mode | 123.08a | |
Std. Deviation | 50.67347 | |
a. Multiple modes exist. The smallest value is shown |
The histogram is not symmetric and indicates that the distribution of the sample is left skewed. When we increase the number in the sample size to 50 people the following histogram (Figure 2) is generated.
The histogram appears to be approximately symmetrical and the measures of central tendency are close. You can see that most of the data are clustered around the mean. Then we increase the sample size to 500 and generate the histogram (Figure 3) and normal curve (Figure 4).
Figure 1 represents the distribution of data which is left skewed, and gradually by increasing the sample size, the distribution looks more symmetrical or like a bell-shaped curve. The curve presented in Figure 4 is called the Normal Curve. The more the curve covers the bars, the more the data is normally distributed. Since the normal curve is supposed to cover 100% of the data, the total area under the curve is equal to 1 or (100%). We can continue the procedure of increasing the sample size to 1000 or even more to realize the distribution of population. By increasing the sample size, the shape of data distribution becomes more symmetrical with respect to the vertical line passing through the Mean (which almost equals the Median and Mode). Although many natural variables approximately follow the normal distribution, it is worth mentioning that the normal distribution is a theoretical distribution and the exact normal curve can not be reached. Therefore, when a dataset is considered to be approximately normal, we are able to use the properties of the normal distribution.
In every normal distribution, we have the following rules (known as empirical rules):
- About 68% of the data fall within 1 standard deviation of the Mean.
- About 95% of the data fall within 2 standard deviations of the Mean.
- About 99.7% of the data fall within 3 standard deviations of the Mean.
For example, the following figure “Figure 5” represents the distribution of the data around the Mean when the data are normally distributed with the Mean of “” and the standard deviation of “”.
When the Mean is “0” and the standard deviation is “1” (N(0,1)), the normal distribution is standard and called Standard Normal Distribution. The following figure represents the standard normal curve.
The Normal Curve graph depends on two parameters, the Mean, and the standard deviation. The Mean indicates the center of the curve which coincides with the Median and the Mode, and the standard deviation indicates the spread or width of the curve. Increasing or decreasing the Mean shift the curve to the right or left respectively. On the other hand, increasing or decreasing the standard deviation makes the curve taller or flatter and wider respectively. Therefore, a normal distribution is defined by two important parameters which are the Mean and the standard deviation denoted by “” and “” respectively, and written as Nμ,σ. The following figures clearly explain the roles of the parameters in the formation of the curve.
In Figure 1, the blue curve is N (5,2) and the orange curve is N (10,2).
In Figure 2, the blue curve is N (5,2) and the orange curve is N (5,8).
As mentioned earlier, the normal distribution enables us to calculate that the probability or the proportion of the data which fall below or above a specified value, or even between two values. We do this by measuring the corresponding area under the normal distribution curve. We standardize the normal distribution and use the normal probability table, since every normal variable such as X with the Mean of and standard deviation of , can be converted to a variable of Z with the standard normal distribution.
If X~ Nμ,σ therefore Z=X-μ~ N0,1.
Statistical software or a calculator enable us to calculate these problems quickly.
To calculate the probability, we can use any online normal distribution calculator such as:
“http://onlinestatbook.com/2/calculators/normal_dist.html” or SPSS.
Suppose the students’ scores in a math class are normally distributed with the mean of 74 and the standard deviation of 6 or N(74,6) and we wish to know:
- The probability of students who scored less than 65 (p(X<65))?
- The probability of students who scored between 64 and 68 (p(64<X<68))?
- The probability of students who scored 77 or greater p(X≥77))?
- The value corresponding to the 80th percentile of this distribution.
The following procedure in SPSS is used for the first three questions:
- Define three variables “Mean”, “SD”, and “X “in “Variable View” window.
- In “Data View” enter the value of the variables in this example (74, 6, 65) for question 1.
- Transform → Compute Variable → in “Target Variable” box, choose a new name for example “Probability 1” → in “Function Group”, choose “CDF & Noncentral CDF” → in “Function and Special Variables”, choose “Cdf. Normal” and double click → fill the “Numeric Expression” box with the values of X, Mean, and standard deviation respectively → Ok
SPSS creates a new column named “Probability 1” with the value in “Data View”.
Solutions:
- p(X<65)≈0.066 that means there is 6.6% (yellow area) chance that a student scores fewer than 65.
- We use the same procedure above to find p(X<68) and p(X<64) then subtract the values.
p(64<X<68)= p(X<68)- p(X<64)=0.047-0.158=0.111=11.1%
- Since the procedure above calculates the probability which is less than a value for the third question (blue area), we first calculate pX<77. Then we subtract it from 1, since the total area under the normal curve is 1. pX≥77=1- pX<77=1-0.691=0.309=30.9%
- We are looking for a value in which 80 percent of the data is less.
pX<a=0.8 →a=79.05. This means 80 percent of the students scored fewer than 79.05.
For question 4, we use SPSS (inverse normal distribution) with the following procedure since this question is the inverse of the questions above.
- Define three variables Mean, SD, and P in the variable view
- In data view, enter the value of the variables in this example (74, 15, 0.8) for question 4.
- Transform → Compute Variable → in “Target Variable” box, choose a new name for example “X” → in “Function Group”, choose “Inverse DF” → in “Function and Special Variables”, choose “Idf. Normal” → fill the “Numeric Expression” box with the values of P, Mean, and standard deviation respectively → Ok
SPSS creates a new column named “X” with the value in data view.
We can also use any online inverse normal distribution calculator such as:
“http://onlinestatbook.com/2/calculators/inverse_normal_dist.html”.
As mentioned above the probability between two values, for example a and b, is the same as the area under the curve between two vertical lines that cross the X-axis, at “a” and “b”. Therefore, the probability of the exact value is always “0”, since there is one vertical line and no area under the curve.
pX=a=0 when X~N(μ,σ)
Many parametric statistical methods that will be explained later require that the data are approximately normally distributed. However, as mentioned earlier, the normal distribution is a continuous distribution and ordinal or nominal data can not follow a normal distribution. According to “Biomeasurement: A Student’s Guide to Biostatistics” by Dawn Hawkins, discrete data with symmetrical bell- shaped frequency distribution can be approximately considered to be normally distributed in practice.
The following numerical and visual outputs must be checked for testing for normality:
- Skewness and kurtosis Z-values should be between -1.96 to +1.96.
As mentioned earlier, a dataset with a skewness and kurtosis of “0” is perfectly symmetric. Values of skewness and kurtosis obtained from SPSS are calculated from a sample, not a population, and can not be interpreted for the population. Therefore, we calculate the corresponding test statistics (Z-value) for skewness and kurtosis by dividing each value by its standard error. The following formulas explain more:
ZSkewness=SkewnessStandard Error of Skewness , ZKurtosis=KurtosisStandard Error of Kurtosis
The Z-values above should be between -1.96 to 1.96, otherwise the distribution of the population has very likely skewness or kurtosis depending on the type of Z-value. We still need to run a normality test and check the following graph which would determine the normality distribution of the population.
- The Shapiro-Wilk test p-value should be above 0.05.
There are several tests to check the normality of the distribution. SPSS has two tests, the Shapiro-Wilk test and Kolmogorov-Smirnov test.
- Histogram, Normal Q-Q plots (or P-P plots), and Box plots.
We explain the above conditions with the data from example 3 to check the normality of the population.
The following procedure is used to obtain skewness, kurtosis, and corresponding standard errors.
SPSS:
Analyze → Descriptive Statistics → Frequencies (move the variable to the right) → Statistics (check skewness and kurtosis) → Continue → Ok
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Skewness | .284 | |
Std. Error of Skewness | .398 | |
Kurtosis | -.478 | |
Std. Error of Kurtosis | .778 |
Z-values for skewness and kurtosis respectively: zskewness=0.2840.398=0.7135 and zkurtosis=-0.4780.778=-0.614. Since the two Z-values are between -1.96 to +1.96, we can run the normality test with the following procedure, which generates the graphs as well:
Analyze→ Descriptive Statistics → Explore→ plot (check Normality with plots and histogram(Shapiro-Wilk))
Output:
Tests of Normality | ||||||
Kolmogorov-Smirnova | Shapiro-Wilk | |||||
Statistic | df | Sig. | Statistic | df | Sig. | |
Blood Glucose Level | .077 | 35 | .200* | .976 | 35 | .639 |
*. This is a lower bound of the true significance. | ||||||
a. Lilliefors Significance Correction |
The P-Value for the Shapiro-Wilk test (0.639>0.05) is not statistically significant. That means we do not reject the null hypothesis which says that the data is normally distributed. (We will explain the hypothesis and P-value later).
“Figure 1” represents a Normal Q-Q plot which indicates that most of data points are along the line or close to the line. In the ideal case, all the data points fall over the line.
“Figure 2” represents the histogram which shows the data are almost symmetric with the slightly positive skewness and flatness.
And finally, “Figure 3” represents the boxplot, which confirms the shape of data explained by the histogram.