Learning Objectives #
By the end of this chapter, fellows will have a clear understanding of:
- Scientific method History
- Experimental design
- Statistical terms
- Variables
Introduction #
Research in simulation is no different than any other kind of research. Research related to simulation is mostly focused on simulation technology, the testing of simulation methodology, and teaching methods and techniques that are used to improve the knowledge and experience of students or physicians in the healthcare system. Therefore simulation research may cover a spectrum of research activities described by Kirkpatrick’s classification. This may include but not be limited to observation, validity, reliability, efficacy, cost effectiveness, and the evaluation of healthcare outcomes that may be affected by educational processes. It is therefore important to understand the basic principles of research, such as the scientific method and experimental design, particularly as it pertains to patient simulation in health education.
Historical Notes #
Since the beginning of time human beings have been curious about the nature of things and how we gain knowledge. Today, psychologists classify knowledge into three categories: Personal knowledge, procedural knowledge and propositional knowledge.
Personal Knowledge –
Knowledge claimed by an individual, such as in statements like I know where the doctor’s office is. Personal knowledge relates to specific information that an individual has gained on their own and personally believes in.
Procedural Knowledge –
Knowledge of how to do things, such as swimming or riding a bike. In this case, the individual is claiming that they possess certain skills without necessarily knowing anything about the theories of thermodynamics or the physics of water and air in relationship to the body, etc.
Propositional Knowledge –
Essentially this kind of knowledge can be considered justified knowledge, or the knowledge of truth and facts, such as that the heart has four chambers.
The first two types of knowledge are simple to understand and do not create controversies. However, propositional knowledge requires justifications that must be based on facts or evidence that are the subject of scientific methods.
Throughout time human beings have ascribed to several schools of thought and philosophy in relation to propositional knowledge. The Ancient Greek schools of philosophy, such as the Hellenistic School, promoted the idea of skepticism for a long time. Many philosophers such as Gorgias (c. 487 – 376 B.C.), Pyrrho of Elis (c. 360 – 270 B.C.), Socrates and others believed that since we cannot ever actually reach the truth we should not claim to know the truth. The famous claim of Socrates, “I know one and only one thing, that I know nothing”, is proof of the extent of skepticism in this period. Pyrrhonian skepticism was described in a book called “Outlines of Pyrrhonism” by Greek physician Sextus Empiricus (c. 200 A.D.). Empiricus incorporated aspects of Empiricism into Pyrrhonian skepticism, stating that the origin of all knowledge is experience.
During the Dark Ages, all trends of rational thinking were totally suppressed by religious dogmatism. Around the 16th century, in the period of reasoning and enlightenment, the idea of skepticism flourished again through the work of Michel de Montaigne (1533-1592) in France and Francis Bacon in England. At this time it was René Descartes (1596 – 1650) who concluded that when he thinks, nothing could interfere with his existence, which is why he irrefutably exists. This famous phrase, “Cogito Ergo Sum”, provided a foundation for propositional knowledge and the development of the scientific method. It was also during this period that philosophical divisions such as epistemology, ontology and theology emerged.
In the context of knowledge justification, human beings have always been curious about natural phenomena and how they occur. What is the cause behind the natural occurrences and how can we control it? In the quest for answers, people slowly developed the modern idea of the “scientific method”. Ancient civilizations such as the Egyptians and others used empirical methods in astronomy, mathematics and medicine. The Greeks also contributed to the advancement of the scientific method, particularly with the work of Aristotle. During the Dark Ages, the work of Islamic scholars such as Ibn Sina (980 – June 1037), Al-Biruni (973-1048), and others also contributed to the development of the scientific method. However, an Islamic scholar named Alhazen, or Ibn al-Haytham (965 –1040), performed optical and physical experimentation and was the first to explicitly state that a controlled environment and specific measurements are required for an experiment in order to draw a valid conclusion. For this reason, some people call him the father of the scientific method. Today the scientific method is defined by the Oxford English Dictionary as a method or procedure that has characterized natural science since the 17th century, consisting of systematic observation, measurement and experiments, and the formulation of testing and modification of hypotheses. Scientific knowledge is reliable propositional knowledge that reflects the truth under specific circumstances and conditions. This knowledge is based on empirical fact, which is open to challenge.
Exploration of cause-event relationships and identifications of functional variables affecting the event led scientists in two important directions. First, to the verification of scientific theories with a generalizable character on the occurrence of events, secondly to assume control over the phenomena of interest. That is why it can be claimed that the scientific method today has both Epistemological and Ontological characteristics.
Epistemology—Introduced by James Frederick Ferrier (1808-1864), the term epistemology comes from the Greek “Epistem”, meaning knowledge and understanding and logos, the study of—is the branch of philosophy concerned with the nature and scope of knowledge, also referred to as the “theory of knowledge”. Epistemology is knowledge about knowledge; it questions what knowledge is and how it can be acquired. In Epistemology two types of knowledge are identified—1) Prepositional knowledge, the knowledge of what, and 2) Acquaintance knowledge, the knowledge of how.
Ontology—The philosophical study of being, existence or reality. It is a branch of metaphysics that deals with questions concerning what entities exist or can be said to exist and how these entities can be classified according to their similarities or differences.
The scientific method was used in applied science long before investigators such as James in 1890, Thorndike in 1903 and others used the scientific method in the fields of psychology and education to observe cause-effect relationships between teaching and learning variables. Under this paradigm of knowledge, the scientific method led to the creation of educational theories such as Behaviorism (Pavlov, Watson 1925; Skinner 1938), Cognitivism (Meisser 1967 and others) and Constructivism (Piaget 19…and Voygotsky19…..). The natural progression for investigators such as Norman, Brooks, Schmidt and others using the scientific method in psychology and education during this time was towards health education.
Today, the scientific method is used not only for examinations and testing of theories and philosophical concepts, but also in the field of Practice Applied Education and psychology. Using the scientific method, we are able to test if innovative approaches to medical education work; if the approaches apply to other areas proven effective for a group of learners in one area of knowledge; use of experts in various educational modalities, and others. Reporting on the scientific method requires proof of the validity of the scientific method; therefore in the beginning, experiments are repeated to prove that results are real and valid.
Based on the mathematical approach to the probability theory developed by Blaise Pascal (1623-1662) and Pierre Fermat (1601-1665), investigators began testing the validity of the scientific method. Perhaps Jules Gararret (1890) and Venn (1888) were the first to use terms such as “test” and “significance” in historical research. In the 1900s, K. Person developed the chi-squared test and W.S Gosset developed the T-distribution test called the “Student T-test”. The modern hypothesis for testing in scientific research was developed using Fisher’s (1925) “Significant Testing” idea and Pearson’s (1922) notion of “null hypothesis testing”. This simply means that the probability in the difference between two steps is insignificant (if you repeat the experiment you are most likely to not have the same results – consistent with the “null hypothesis”) or the probability between the two groups is significant (if you repeat the experiment it is likely to have the same results – rejecting the null hypothesis.) An experiment is valid only when the null hypothesis is rejected.
By definition, the term null hypothesis assumes that any kind of difference observed in a set of data is due to chance, and no statistical significance exists in a set of given observations (a simple variable is no different than zero). In scientific experimentation two types of error can occur:
- Type I: the null hypothesis is falsely rejected giving a “false positive”.
- Type II: the null hypothesis fails to be rejected and an actual difference between populations is missed giving a “false negative”.
Some believe that too much emphasis has been placed on the null hypothesis, while other investigators believe that it is not the difference between groups that is important but rather the size of the difference. Still others believe that focusing on the effects is not sufficiently broad to cover all aspects of the data analytics. Nevertheless, at this particular time, null hypothesis is the dominant model in scientific research.
In keeping with this concept, it is logical to relate the scientific method to three important principals of critical thinking: empiricism, use of empirical evidence; skepticism, avoidance of dogmatism and openness to challenge of the hypothesis; and realism, the use of logical reasoning.
The structural components of the scientific method therefore are as follows:
1. Construction of a question that can be spontaneous and comes from an observation, prior knowledge or identification of a gap in a practical or theoretical aspect of knowledge.
2. Information gathering to ensure that the question will solve the problem, or to establish a problem statement based on available information.
3. Proposal of a solution to the problem or an answer to the question using abstract or logical thinking. This is called the Scientific Hypothesis. A hypothesis is an informed, testable and predictive solution to a scientific problem.
4. Testing the hypothesis by experimentation or further observation using logic or empirical evidence. Experimentation is the most important part of the scientific method and has to be tangible and measurable. If a hypothesis cannot be proved by experimentation it is not a valid scientific hypothesis.
5. Establishment of scientific facts. The results of scientific experimentations that provide solutions to the problem, answer the question, and can be ended with a conclusion. A conclusion is a statement of the truth based on scientific hypotheses and tested by scientific experimentation, and a specific conclusion is drawn from that. The conclusion expresses a specific degree of confidence, usually 95%% or more.
6. Corroboration. This is when several evidences are in agreement of support, evidences gathered by independent authorities in the field. In science, when a hypothesis is tested, we take the data from several authorities and make sure ample reliable evidence is provided that would be irrational to deny. This type of reputed experimentation with similar results will go through the mainstream knowledge of the truth. This is called evidence-based knowledge, and can also be challenged in different circumstances. Science is always open to new evidence.
Experimental Design #
The concept of causality was described by David Hume (1711-1776), and later several methods were proposed by John Stuart Mill (1802-1873) for establishing cause-effect relationships. The randomized control trial method is based on Mills’ method of difference, which is based on the isolation of the effect of a single variable controlled by the investigator, an “independent variable”, and other variables that are observed, “dependent variables”. The measurement of the outcome of dependent variables affected by the independent variable and/ or manipulation of the independent variable will show if the change in dependent variables is caused by the effect of independent variables.
Structure of Experimental design #
Experimental design is the process of designing a study to meet specified objectives and be able to make inferences. In order to answer the research questions, one should conceder the following two important issues when planning an experiment:
- To ensure that the design provides the right type of data to be collected
- Sufficient sample size and power are available.
The steps of experimental design are the same as the components of the scientific method described previously, and include research questions, review of background literature, formulation of hypothesis, objectives, conduction of experiments, data collection, data analysis, making assumptions, conclusions and corroboration. At this point it is important to discuss some of the fundamental issues in the context of experimental design that may affect outcomes. These issues mainly revolve around the sample size calculation, experimental units, variables, treatment and design structure, as well as some basic statistical approaches to scientific research.
Experimental Units
EUs are individuals or objects under investigation by the researcher. Pending the objective of the study, the sampling unit can be students, teachers or a group of students or teachers. The Unit could also be animals or patients or groups of patients. It is important to note that sampling/experimental units are the smallest units of data being collected. In order to know how many experimental units are required for a particular experimentation it is essential to do sample size calculation.
Sample Size Calculation #
The first practical step for the design of an experiment is to know how big the experimental groups should be or how large a sample is required. In order to determine this important value for experimentation, it is important to keep in mind three important questions.
- How accurate the answer should be.
- What level of confidence is required for the experiment.
- Is there any prior knowledge in this area similar or close to the subject of the experiment?
If the answer to all these questions are available and the purpose of the experiment is to discover the proportion, the following formula is used:
ME=z √ P (1-P)
n
Where:
ME is the desired margin of error
z is the z score,
P is the prior judgment and
n is the sample size
The investigator or experimental designer determines the desired margin of error, usually within 1-4%%. The experimental designer may use any value between 1 and 4%%. The Z-score is a statistical measurement of scores under a normal curve in relation to the mean. It is calculated to discover the probability of the mean in relation to the confidence interval. A Z-score of 1.96 has a confidence interval of 95%%, which is often used for experiments in this area.
When the estimated value is not proportionate to the mean, or in cases when the standard deviation is unknown or the sample size is less than 30, the following formula with a T-score is used instead of a Z-score for the calculation of sample size.
This formula is:
ME= t S
√ n
Where:
ME is the desired margin of error
T is the score used to calculate the confidence interval that depends on both the degree of freedom and the desired confidence interval
S is the standard deviation
n is the sample size.
This formula is used when n is less than 30. Under these conditions, the value of T is very close to the value of Z. This is why the same value of 1.96 for a 95%% level of confidence is used, and this is also why in both formulas 1.96 appears as a constant. Since the standard deviation is not known, the value of S is determined by prior knowledge/experiment or simply guessed.
Note: Additional information on the calculation of sample size is provided in the reading references in this chapter.
Variables
By definition a variable is a characteristic that may assume more than one set of values to which a numerical measure can be assigned. In any experimental design, four types of variables are taken into consideration:
- Primary Variables or Variables of Interest
- Constant Variables
- Background Variables
- Uncontrollable / Hard to Change Variables
These are independent variables, also called factors, which form the treatment plan in the experimental design and serve as factors that cause the effect and possible variations in response.
Variables that are not part of the treatment but can affect the experiment and can be controlled during experimentation. These variables may include the use of equipment, standard procedures and operators, measuring devices, time, location and others.
These are not variables of interest or treatment but are present by default and may influence the outcome of the experiment. The important characteristic of background variables is that they are measurable but cannot be controlled. Background variables are treated as co-variants in experimental design. Statistical methods of co-variant analysis are used to remove the effect of these background variables.
These are like background variables in that they are not variables of interest but are able to affect the variables of interest and subsequently influence the outcome and conclusion of the experiment. Characteristics of these variables include that certain conditions prevent them being measured, controlled or manipulated.
Treatment Structure #
Treatment Structure must include the factors the researcher is interested in (also called Independent Variables or Primary Variables), which are directly related to the primary objective of the study and form the conclusion of the study. In the treatment strategy, one can study a single factor with a single effect; a single factor with multiple effects; multiple factors with a single effect or multiple factors with multiple effects. It is also important to define the level of the factor of interest. There can be a range or variety of levels (also called subsets) in a treatment factor, which is also important to take into consideration in the experimental design. Treatment factors can be fixed or random. Fixed factors are when the factor has a small number of defined levels that are considered in the experimental design, such as gender (Male or Female). Also, a range of levels may exist in a treatment factor but for the purposes of the experiment, specific levels are chosen (such as specific years and the age of a person). If the level of treatment is not defined and the levels are chosen randomly, it is called a random treatment factor.
When a combination of > 2 factors or levels are considered in the experimental design, it is important to be sure they are logical and have a compound effect, such as synergism, antagonism and others. Sometimes the treatment factor is repeated several times in the same experimental unit. This is referred to as replication. Replication should not be considered an additional experimental unit. Replication is important for reducing errors in an experiment, and is different than repeated measures. Repeated measures are when the effect of the same factor is repeatedly measured on the same subject in specific time intervals. In this case each set of data is treated as new set of data.
In consideration of treatment structure it is important to consider the following assumptions:
- Since the independent variable always affects the dependent variable, the experiment has only one direction, from Independent ⟹ Dependent, not the other way around.
- Since only experimental variables are systematically manipulated, it is important to make sure that other explanations (such as Background Variables, Constant Variables, Uncontrollable Variables) for the difference are eliminated.
Design Structure #
In an experimental design, experimental units are allocated to a treatment factor randomly. Biases are inherent in every aspect of experimental design, therefore it is important not to try to eliminate biases but to understand which biases will be acceptable in each particular circumstance. Randomization, essentially meaning the allocation of experimental units randomly, is important for reducing the incidence of bias.
If a specific constraint is introduced to randomization, this is called Block Design. For example, experimental units such as students could be assigned to a treatment randomly or students can be divided in groups, or blocks, by characteristics that make them homogenized. The former is called Completely Randomized Design, where units are assigned to treatment randomly, and the latter is called Randomized Complete Block Design, where the blocks are formed first and then randomly allocated to the treatment factors. Block Design provides stratification of the experimental units with similar values and consideration of the effect for each group. This means that the groups did not differ systematically from each other on any other variables that might cause a difference in the outcome.
If a particular variable is measured after the groups are formed, it is called a covariate, and the effect of this covariate can be removed using statistical analysis of covariates. Therefore the only effect left will be caused by the independent variable.
No matter how these methods are used to eliminate extraneous variables and control the effect of independent variables, it is impossible to eliminate all alternative variables. As such it is advised to utilize randomization of subjects in the formation of experimental groups and then create randomization of the groups/block. Randomization will not remove or equalize the effects of the alternative variables, but it will eliminate bias in experimental design.
Data Analysis Summary #
The first and most important issue in data analysis is to develop specific documents like worksheets, data collection sheets, tables, etc., at the time of experimental design, that reflect the data in your objective and conclusions. This will help to properly document the collected data in relation to the results and outcome of experimentation. The data can be placed or transferred into a database sheet with a particular statistical package or simply in an excel sheet for statistical analysis. The type of statistical analysis required for data analysis is directly related to your research question, objectives, hypothesis and inferences.
The term Statistical Significance is complex. In statistics the black and white (Y or N) answer to a research question is extremely difficult. Most of the time statistics offer a level of significance reflected as a P-value, which shows the probability of a value accurately rejecting the null hypothesis. If the experiment is designed for a confidence level of 95%%, the P-value is ≤ 0.05; for 98%%, the P-value is ≤ 0.01; if 99%%, the P-value is ≤ 0.001. P-value is not the indicator of the size or importance of the observed effect. Simply put, P ≤ 0.05 indicates that if the experiment is repeated 100 times, 95%% of the time the same results will be obtained.
Data analysis has two main steps. The first step in data analysis is descriptive statistics that assess if the data obtained by the experiment meet all the requirements of the experimental design. Descriptive statistics can be used to summarize population data. Numerical descriptors include mean and standard of deviation as well as frequency and percentage for categorical data. Normal distribution is a bell-shaped curved with equal distribution on both the left and right side of the mean. Normal distribution is a mathematical assumption that indicates which probabilities are made on the basis of repeated measurements of specific variables. For example, normal distribution would tell us how many students did extremely well or extremely poorly on an examination in relation to the mean, median or mode, as charted by the majority of the students’ scores. Although most of the time it is assumed by researchers that their data will have a normal distribution with a bell-shaped curve, this is not always the case. Therefore it is important to look for the following abnormalities in the distribution of data.
First, distribution can be skewed to the left or right. When a median exceeds the mean, the curve is skewed positively to the right. In our example this means that more students who took the examination have a higher score than the mean score. When the median is less than the mean, the curve will be skewed negatively to the left. In our example this indicates that more students have a lower score than the mean score.
The shape of the bell curve can also be challenged. It can be flat or slim with a high peak. This is called kurtosis. Flat curves are called playtokurtic and tall slim curves are called leptokurtic in relation to mesokurtic, which is a normal distribution. The curve appears flat when there is a large amount of variability in the measurement with a large number of standard deviation from the mean. Slim peaked curves occur when there is little variability in the measurement and the standard deviation is very small. The area under the normal curve on both sides of the mean reflects the units of standard deviation, also referred to as the Z-score. In a normal distribution, the mean distribution has a Z-score of 0 and a standard deviation of 1.0. In the calculated value of a normal distribution, the area under the bell on both sides of the mean with a standard deviation of 1 is 34.13%%; with a standard deviation of 2 is 13.59%%; and with a standard deviation of 3 is 2.15%%. The calculation of these areas with a Z-score will provide an indication of the normality of the curve. Z-scores are also used to calculate sample sizes for experiments.
Bell curves are also used in education to determine percentile scores of students. If it is assumed that the mean is 50%%, the distribution on both sides of the mean will indicate where a particular student fits in relation to the mean score of the other students. In educational circles, normal distribution is also used to obtain other scores such as T-scores.
Scatter plot (Scatter graph) is another form of assessment of the distribution of experimental data, where individual data points are plotted in an area between the X and Y-axis.
Statistical analysis of experimental data is mainly about the relationship between dependent variables acquired as a result of the effect of the independent variables. The cause-effect relationship is established not only by observing the effects in a treatment group, but also by comparing the effects with an identical group where all of the conditions of the experiment are met without the treatment (control group). Therefore, the second step of data analysis is called Inferential Statistics. Inferential statistics is used to draw meaningful conclusions about the entire population. These inferences may include a simple yes or no answer to the scientific questions and hypothesis testing, estimated numerical characteristics of the data, describing relationships within the data as well as estimates of forecasting and predictions.
The most common inferential statistical analysis for the comparison of two independent groups is called an Unpaired T-test or a Student Test. If the subject of investigation is only one group but measurements were taken before and after the treatment, the statistical analysis used is called a Paired T-test. If the treatment is applied to several (2-3 or more) independent groups, the statistical analysis required is ANOVA (analysis of variance), a statistical test that shows if the mean of several groups are equal or if there are differences amongst group’s means and associated procedures. This is like a T-test but for more than 2 groups, and calculates variations amongst and between the groups.
For a single factor, a One-Way ANOVA studies the effect of a factor on different groups. A Two-Way or Multi-Factor ANOVA uses statistics to analyze multiple factors on several groups. When the experiment includes observations of the variety of the levels of the same factor, it is called factorial. In cases where the effects of treatment are observed in one group but repeated in several intervals, the appropriate statistical package is a Repeated Measure ANOVA.
In complex conditions when the effect of several treatments is observed in several groups or in several time intervals, and the researcher wishes to identify the cross factor effects among all treatments and groups, ANOVA with Orthogonal Contrast is applied.
Correlation is dependence between two random variables or two sets of data, and is used to indicate predictive relationships between variables not related to causality. In practice the degree of correlation is important and is calculated by Person’s Correlation Coefficient, which is speculative as to the strength of relationships amongst variables.
Regression Analysis is a statistical process for estimating the relationship among dependent and independent variables or variances, explaining how the value of dependent variables are changed when the effect of one independent variable is blocked or controlled. Regression analysis may predict or forecast the causality but one must be extremely careful when expressing this type of relationship as a cause-effect relationship. The major difference between correlation and regression is that correlation is all about the relationship between values and variables (X and Y) and it doesn’t matter where the position of X and Y is. Regression on the other hand is concerned with the effect of one variable independent on a dependent variable as a one-way street. It matters where the X and Y are located. Correlation cannot speak to causality but regression is all about causality and cause-effect relationships.
When the distribution of data is skewed, the numbers of experimental units are small or when the effects of the treatment are measured not by integer but by categorical data, nonparametric statistics are applied. Nonparametric statistics are where probability distribution is not based on preconceived parameters, which is why nonparametric tests don’t make assumptions on probability distributions. The parameters for calculation in the nonparametric arena is to not come from the data but be generated from the data. Some statistical analysis with nonparametric statistics that may be useful are listed below:
Histograms – Show nonparametric probability distribution in a graph. Carl Pearson described this data for nonparametric use.
Man Whitney’s U-Test – For the comparison of two samples coming from the same population with asymmetric distributions, when one population has a larger value than the other. It is essentially a T-Test for nonparametric values. In this test, a median is used instead of a mean.
Cohan’s Capa – This test measures interrated agreement for categorical items.
Freedman’s test – A nonparametric version of the two-way analysis of variance, determining when K treatment is in randomized block designs and has identical effects.
Corcoran’s Q-test – A test in which K treatment is in randomized block design with 1-0 outcomes having identical effects. This is used for the analysis of two-way randomized block design, where the response variable can express only two possible outcomes (0, 1).
Crisco Wally’s test – A one-way ANOVA nonparametric version that tests whether less than two independent samples are drawn from the same distribution. This is an extension of the Man Whitney U-test where there are more than two groups.
Candle’s W-test – Used for assessing agreement amongst raters, much like the Person’s Correlation Coefficient but not requiring probability distribution and can handle any number of distinct outcomes.
NOTE: This chapter is an introduction to the scientific method and experimental design, and certainly does not cover all aspects of research. The intent was to provide some general and basic understanding of the scientific design and experimental method for beginners. For more information on this topic one can take specific statistic courses and read other available literature.
Data Analysis
The science of statistics always deals with numbers and figures which play a significant role in any statistical project. Statisticians make decisions and predictions based on evidence that is called data so the underlying cause of statistics is data. Data is the information represented by numbers or words to expand our knowledge to decide or interpret a problem. Statistics indicates how to collect and organize data and describe how to analyze and interpret them. In the beginning, we have raw data which is collected from a survey or a measurement. This data is not particularly useful by itself but statistics enables us to draw useful information from it and understand how it behaves. This is the problematic part of statistics that enables us to interpret the data for any reason such as making decision, testing hypothesis, prediction and others. Indeed, we apply statistics to legitimize our claim or test the efficiency of new product or make a prediction. Statistics usually is categorized into two categories.
- Descriptive Statistics that illustrate collecting data, organizing them in the graph or tables, and summarizing into numbers (Mean, Median, Mod).
- Inferential Statistics that apply descriptive statistics to do interpretation or prediction of data.
The first and foremost issue in data analysis is to develop specific documents such as worksheets, data collection sheets, tables, etc., at the time of experimental design, that reflect the data in your objective and conclusions. This will help to properly document the collected data in relation to results and outcome of experimentation. The data can be placed or transferred into a database sheet with a statistical package, or simply in an excel sheet for statistical analysis. The type of statistical analysis required for data analysis is directly related to your research question, objectives, hypothesis and inferences.
Descriptive Statistics
Data analysis mainly has two steps. The first step in data analysis is descriptive statistics to assess if the data obtained by the experiment meats all the requirements of the experimental design. Descriptive statistics can be used to summarize the population data.
Frequency Distributions and Graph
After gathering data for the specific variable, the next step is organizing these raw data by constructing a frequency distribution. Sometimes it is more convenience to represent the data visually by graphs that enable us to interpret the trends or patterns of the data instantly. We try to demonstrate data by generating statistical charts and graphs such as histograms, pie chart and stem and leaf plot or frequency table.
Two types of frequency distributions that are most used are the categorical and grouped frequency distribution.
Frequency distribution table
When the range of data is large, the data must be grouped into classes that are more than one unite in width. We are going to construct a frequency table for data and classified them into some classes.
For each class, we have lower and upper-class limit which are where the class starts and ends.
Class Width is the difference between lower class limit or upper-class limit for each class and lower class limit or upper class limit of previous or next class. Then we should determine the number of the classes.
It is preferably between 5 to 20 classes (not too few groups nor too many). Recommended to find the smallest k that 2k>n when k is number of classes and n is the sample size.
H=Highest value L=Lowest value R=H-L (the range) Width=Rk
Example 1: The following data represents the number of children in a random sample of 50 rural Canadian families.
“Reference: American Journal Of Sociology Vol. 53, 470-480”
11 | 2 | 9 | 3 | 9 | 6 | 9 | 5 | 14 | 5 |
13 | 5 | 2 | 3 | 4 | 0 | 5 | 2 | 7 | 3 |
4 | 0 | 5 | 4 | 3 | 2 | 4 | 2 | 6 | 4 |
14 | 0 | 2 | 7 | 3 | 6 | 3 | 3 | 6 | 6 |
10 | 3 | 3 | 1 | 2 | 5 | 2 | 5 | 2 | 1 |
The table above is not informative thus we organize the data with the following SPSS procedure into the frequency distribution table.
First, we need R that can be computed by SPSS with the following procedure:
Analyze → Descriptive Statistics → Descriptives (move the variable into the right box) → Option (check Range, Maximum, and Minimum) → Continue → Ok
Output:
Descriptive Statistics | ||||
N | Range | Minimum | Maximum | |
Number of children | 50 | 14.00 | .00 | 14.00 |
Valid N (listwise) | 50 |
With the recommended formula for k (2k>n) we find the smallest k that 2k>50 is 6 ( 26=64>50). Therefore, we have R=14, n=50, and k=6 then width=14/6=2.3 and we usually round up which gives us width=3 (since data are discrete we can not use decimal number). Afterward we use the following SPSS procedure to find the frequency table.
- Define your variables in Variable View window.
- Enter your data in the Data View window.
- Transform→ Visual Binning (move your variables to Variables to Bin) → Continue→ (choose a new name for your variable into Binned Variable, choose Upper Endpoits options) → Make Cutpoints (First Cutpoint Location=L or L-1, Number of Cutpoints=number of classes then Width that is filled automatically) → Apply→ Make Lables→ Ok
- Analyze →Descriptive Statistics →Frequencies (move the new variable into Variables box) → Charts→ Histogram → Continue.
(in this example, the First Cutpoint Location=-1, Number of Cutpoints=6, and Width=3)
Output:
Number of children (Binned) | |||||
Frequency | Percent | Valid Percent | Cumulative Percent | ||
Valid | .00 – 2.00 | 14 | 28.0 | 28.0 | 28.0 |
3.00 – 5.00 | 21 | 42.0 | 42.0 | 70.0 | |
6.00 – 8.00 | 7 | 14.0 | 14.0 | 84.0 | |
9.00 – 11.00 | 5 | 10.0 | 10.0 | 94.0 | |
12.00 – 14.00 | 3 | 6.0 | 6.0 | 100.0 | |
Total | 50 | 100.0 | 100.0 |
The table above is more informative and gains our information about data. For example, this table enables us to state 6% of families have more than 12 children. We can also demonstrate the data in visual or graphical form. That gives us a quick idea about the pattern of the data. Following are some most commonly used statistical graph.
Bar chart (mostly used for categorical data), pie chart (mostly used for categorical data to show the percentage or proportion of each data), histogram (mostly used for quantitative continuous data), stem-and-leaf plots (used to show how the data are spread), and frequency polygon (used to show the shape of distribution of the data to diagnose the skewness and kurtosis).
For the example 1 with the following SPSS procedure we generate some graphs.
SPSS:
Analyze →Descriptive Statistics →Frequencies (move the variable into Variables box) → Charts→ Bar charts→ Continue.
With a quick glance at this bar chart we realize that most of the families have 2 or 3 children and less families have more than 10 children. The distribution of the data is not symmetrical since the left side of the shape has more data. With the following procedure, we generate the frequency polygon that shows the shape of the data even more understandable.
Graphs → Chart Builder → Line (drag the simple one into the top (chart preview uses example data box)) → drag the variable into the X-Axis box → Ok
Output:
The frequency polygon above gives quick information about the shape of distribution of the data in terms of symmetry and flatness.
The data above is discrete although for continuous data we can use the same SPSS procedure. However, for eliminating the gap between upper limit of a class and lower limit of the next class there is one more step that you can find out in statistics books.
Frequency distributions table (Categorical data)
It is used for nominal or ordinal data such as blood types or gender or ranking of universities. In this case, we deal with letters or words therefore in order to draw information from them and make them more understandable we give them a value of number.
Example 2: Suppose we study 50 patients with heart disease and the following fictitious data is their blood type. We are not able to interpret and obtain information from them therefore we organize them in the following table.
AB O B O O B B A A B AB B B O O O O B B O AB A AB B B O O O AB A B B AB O O O B AB B AB O O O A B B AB O B
We can use SPSS to construct a frequency distribution table. First in the variable view define the variable which is “Bloodtype”. Then in the Values column define a numeric value for the variable. For example, we assign the type A the value of “1”, the type B the value of “2”, the type O the value of “3”, and the type AB the value of “4”. For instant in the value box type 1 and in the label box type A therefore SPSS consider 1 as blood type A. Afterward, in the Type column check String since the data is nominal. Finally, in the Data view enter the data.
SPSS:
Analyze →Descriptive Statistics →Frequencies (move the variable into Variables box) → Charts→ Bar charts or Pie charts→ Continue.
Output:
Bloodtype | |||||
Frequency | Percent | Valid Percent | Cumulative Percent | ||
Valid | A | 5 | 10.0 | 10.0 | 10.0 |
B | 18 | 36.0 | 36.0 | 46.0 | |
O | 18 | 36.0 | 36.0 | 82.0 | |
AB | 9 | 18.0 | 18.0 | 100.0 | |
Total | 50 | 100.0 | 100.0 |
The frequency table and bar chart above give us more information than the raw data. The table indicates for example 36% of patients have either type O or type B blood type.
Histogram
Histogram is the most common statistical charts (diagrams) that represents the distribution of the variables. The patterns of the dataset that are represented by histogram can be symmetric, skewed right, or skewed left. We represent a continuous dataset by histogram and a discrete dataset by bar chart and generate statistical charts by SPSS with the following procedure:
Graphs → Legacy Dialogs
We can summarize the data even smaller, that represents the information in one specific number that is the average of them. This number called the mean of sample influenced by all the data. Therefore, we try to describe our dataset in one single value called measures of central tendency. We can specify the center of data distribution, halfway into the dataset, and the most frequent data by three values called measures of central tendency. Indeed, these measures are more applicable than the frequency of table or even the graphs. Especially when we want to compare two or more groups of population in terms of the variables we are not able to compare the entire data of the population or the sample. However, we can compare their representatives and generalize our answer to the entire population. That enables use to make decision way faster with less expense.
Measures of Central Tendency
Mean (X), Median(MD) and Mode(MO) are three main measures of central tendency that can approximately describe the whole set of sample or population. These measures are used to indicate the center of the data distribution. When the measures are used to describe the characteristics of a population they are called Parameter. Likewise, when they are used to describe the characteristics of a sample they are called Statistics.
The Mean
The most important and frequent measure of central tendency is the mean. In another word the mean is the average of a dataset and it is reliable since each data has influence in this number. Suppose you run a business that each day has a different profit. If you wish to describe your income for a week it does not make sense to mention each day profit. Therefore, instead of representing the 7 numbers you may represent your income during a week by the average of these numbers. That enables every body to have the idea of your weekly income. Furthermore, you are able to compare your income for each week. Recall the example 1 with 50 data points which were not informative before constructing the frequency table. If we wish to compare this data of the 50 families to for instant their US encounter part, we are not able to compare all the data. Indeed, we compare their representatives which are the means. The mean for this data is 4.7≈5 (since the data is discrete and 4.7 children is meaningless). That means the average of the number of the children for these 50 families is 5. In another word 5 is concentrated of the 50 data. The data with more frequency and high value has more influence in the mean. The following formula is used to calculate the mean.
Sample Mean: x=x1+x2+x3+…+xnn=xn , where n is the sample size
Population Mean: μ=X1+X2+X3+…+XNN=XN , where N is the population size
Population Mean is usually an unknown parameter and estimated by the sample mean (point estimate). The formula mentioned above is easy to calculate. The mean is a reliable representative of the dataset since it consists of all values of the data. Although there are some disadvantages for the mean. The mean is affected by all the data which is not always good. If there is very large or very small data that causes the value of the mean increases or decreases respectively. That means if there is an extreme value or data (outlier) in our dataset the mean is higher or lower than usual. In this case, the mean is not a proper representative of all the data and we can not trust it. Therefore, in case of having outliers in the dataset or skewed data we prefer to use the median. The mean is not used for qualitative data.
The Median
The median is the midpoint of the data denoted by MD. In another word, exactly half of the data has the value greater than the median or less than median. Recall the example 1 again, the median for this dataset is 4. That means fifty percent of families have less than 4 children and fifty percent of families have more than 4 children. This measure of central tendency is not sensitive to outliers since the value of all the data is not used in the calculation. In order to calculate the median, first we sort the data in ascending order. The middle value is the median if the number of values in dataset is odd. If the number of values in dataset is even the median is the average of the two middle values.
The Mode
The most frequent occurring value of data in the dataset is the mode. In another word, the data with the highest frequency. The mode can be used when the data is coming from nominal or ordinal variable. The mode is not necessarily unique for each dataset which is sometimes confusing. Dataset with two modes called bimodal and with more than two modes called multimodal. When all the data occurs with equal frequency thus the dataset has no mode. It can be easily found in the bar chart or histogram which the highest bar represents the mode. We usually use the mode when we have qualitative data since the mean and median are meaningless.
Measures of Variability (dispersion)
When we have quantitative variables in addition of measures of central tendency, we must consider measures of dispersion. These measures indicate how much data are scattered around the measures of central tendency. In order to describe our dataset with measures of central tendency we calculate the measures of variability to indicate how well the measures of central tendency describe the dataset. Suppose a teacher wants to compare the math scores for two different groups of students.
The following data represents the scores for each group.
Group 1: 71, 72, 64, 78, 73, 69, 76, 72, 67, 70 x1=71.2
Group 2: 100, 73, 81, 68, 98, 91, 10, 100, 21, 70 x2=71.2
The mean for group 1 is equal to the mean for group 2, although this does not mean that both groups are at the same level of strength in math. For group 1 “71.2” is a proper representative since almost all the dataset is not very far from the mean or even each other. The variability between the values of data in group 1 is not high therefore the mean is more reliable. On the other hand, also the values of data in group 2 vary from one data to another one. Therefore, the mean is not a perfect representative since there are two weak scores which are “10 and 21”. In conclusion, a dataset with more consistency has reliable measures of central tendency and that is the most important reason to calculate the variability of a dataset.
Range
The simplest measure in terms of calculation, is the range that is the difference between the largest and smallest data denoted by R. For the example above, each group’s range is R1=14 and R2=90 respectively. Comparison between two ranges indicates that variation between the data values in group two is considerably higher than group one. This measure in addition of simplicity and instant, can show us whether we have outliers or odd data. However, since only two data influence it we look for other measures that apply the entire dataset.
Standard Deviation
The most frequently used measure of variability is standard deviation which is somehow the average of deviations of each data from the mean. This measure shows the spread between the data values more precisely since it takes the entire data into the account. The value of the standard deviation depends on the variability between the dataset. The more variability or spread between data we have higher standard deviation and vise versa.
The Sample Standard Deviation: s=s2= (x-x)2n-1
The Population Standard Deviation: σ=2= (X- μ )2N
Variance
The variance is the squared standard deviation therefore it is denoted for sample and population respectively by s2 and ơ2.
SPSS:
We can simply compute these measures by SPSS.
Analyze→ Descriptive Statistics → Frequencies (move the variables) → Statistics (check central tendency or Dispersion) → Continue
Example 3: Suppose the following data indicate the fasting blood glucose for 35 randomly selected adult people. We want to find measure of central tendency and variability with SPSS.
94.82 85.06 80.16 90.31 100.99 108.95 85.77 79.96 91.62 86.05 89.20 81.16 84.68 90.42 97.59 79.59 88.81 95.34 90.10 89.17 104.24 101.76 93.55 101 71.97 92.14 106.90 78.52 82.53 84.06 91.65 75.58 107.69 94.28 83.52
First, we start with our quantitative continuous variable which is Blood Glucose levels and define that in the variable view window then enter the data in the data view window finally use the procedure above. The following table shows measures of central tendency and measures of variability.
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Mean | 90.2594 | |
Median | 90.0975 | |
Mode | 71.97a | |
Std. Deviation | 9.33174 | |
Variance | 87.081 | |
Range | 36.98 | |
Minimum | 71.97 | |
Maximum | 108.95 | |
Sum | 3159.08 | |
a. Multiple modes exist. The smallest value is shown |
There are two ways used to generate the histogram and display the normal curve for the data.
- Graphs → Legacy Dialogs (choose histogram and move the variable to the right and you can check Display normal curve)
- Graphs → Chart Builder → Ok → select Histogram → double click on the simple histogram (the left one) → drag the variable (Blood Glucose Level) into the X-Axis → check the Display normal curve on the right side → Apply → Ok
Output:
Skewness
As we mentioned, the statistical graphs and charts enable us to distinguish the pattern or shape of the distribution. Consider the following histograms of three different types of data distribution.
We can see the shape of data distribution in figure 3 is almost symmetric as opposed to figures 1 and 2. In this figure the mean, median, and mode are approximately equal or close. However, In the figures (1 and 2) we see skewness either to the right or left. The measures of central tendency are not close. In figure 1 the mode is the smallest one since the data with low value has more frequency. The mean is dragged toward the tale and the median is between them. Vise versa in figure 2, the mode is higher than the mean and median. The mean is the smallest measure that is dragged toward the left. Therefore, in case of having skewed data, the median is the best representative.
Frequency Distribution Shapes
There are the three most important shapes which are positively skewed, symmetric, and negatively skewed.
- Positively Skewed (Right Skewed) Distribution (skewness is positive)
In this case the mean is the largest measures, the median and the mode come in order. That means the majority of the data values are smaller than the mean. In this shape, the tails are to the right. (figure 1)
- Symmetric (skewness is zero)
In this distribution, the three measures of central tendencies have almost the same values. That means the data values are distributed evenly on both sides of the mean. (figure 3)
- Negatively Skewed (Left Skewed) Distribution (skewness is negative)
This is the opposite of number one. The mean is the smallest measure and the median and mode are respectively greater than the mean. That means the majority of the data values are greater than the mean. In this shape, the tails are to the left (figure 2). Therefore, skewness is a measure of asymmetry and we use the range of (-1,1) as an acceptable range for being normally distributed.
Kurtosis
Kurtosis is a measure of the shape of a dataset that describes peak of the dataset. The curve can be normal, or flat, or peaked. For normal curve, the value of kurtosis is zero, for the flat curve is negative, and for the peaked curve is positive. An acceptable range for kurtosis is (-2,2).
We can compute the skewness and kurtosis of the dataset by SPSS:
Analyze → Descriptive Statistics → Frequencies (move the variable to the right) → Statistics (check skewness and Kurtosis) → Continue → Ok
As the following histogram from the example 3 shows that data is approximately symmetric.
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Skewness | .284 | |
Std. Error of Skewness | .398 | |
Kurtosis | -.478 | |
Std. Error of Kurtosis | .778 |
The table above indicates that the skewness and kurtosis which are in the acceptable corresponding ranges. We explain z-value for skewness and kurtosis later in order to test the normality.
Measures of Position
These measures indicate the locations of any data value with respect to each other when we arrange the data in ascending order.
- Percentiles
The value of the data which is the specific percentage of observations are smaller than that. They are denoted by P1, P2, …, P99. For example, the 95th percentile is the value of data that 95% of the values are smaller than that when the data value is in order. Hence, the median is the 50th percentile which half of the values are smaller than it. Therefore, the percentiles divide the frequency distribution into 100 equal groups. For the example 3 with the following procedure in SPSS the percentiles can be found.
Analyze → Descriptive Statistics → Frequencies (move the variable to the right box) → Statistics (check the percentiles you wish to calculate and Add) → Continue → Ok
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Percentiles | 75 | 95.3398 |
95 | 107.9441 |
For example, we required the 75th and the 95th percentiles, that are 95.3398 and 107.9441 respectively. That means 75% of the individuals have less than 95.3398 of blood glucose level. Likewise, 95% of the individuals have less than 107.9441 of blood glucose level.
- Deciles
The deciles divide the frequency distribution into 10 equal groups. They are similar to the percentiles as the first decile is the 10th percentile. That means 10% of the values are smaller than the first decile. The deciles are denoted by D1, D2, D3, …, D9 that correspond to P10, P20, P30, …, P90. Since we can not use SPSS to calculate deciles particularly we calculate the correspondences (P10, P20, P30, …, P90) instead.
- Quartiles
Q1, Q2, and Q3 that divide the data into four equal groups (quarters) are represented as the first, second and third quartiles. The first quartile (Q1) is the same as 25th percentile, the second quartile (Q2) is the same as 50th percentile or the median or fifth decile and the third quartile (Q3) is the same as 75th percentile.
SPSS:
Analyze→ Descriptive Statistics→ Frequencies→ Statistics (check Quartiles).
We run the procedure above for the example 3:
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Percentiles | 25 | 83.5247 |
50 | 90.0975 | |
75 | 95.3398 |
IQR or Interquartile range
IQR is the difference between Q1 and Q3 and is used to identify outliers. For the example 3:
IQR=95.3398-83.5247=11.8151
Outliers
Outliers are extreme data values either high or low compare to other data values. There are two types of outliers which are mild and extreme outliers. A data value is considered as an extreme outlier when it is outside of the interval (Q1-3IQR, Q3+3IQR) whereas a data point is a mild outlier when it is outside of the interval (Q1-1.5IQR, Q3+1.5IQR). Boxplot also represents the quartiles and outliers. In SPSS, mild outliers and extreme outliers are represented with circles and stars respectively. It is worth mentioning that we should not drop outliers without an appropriate reason although they can affect the accuracy of our result. If the data are normally distributed, we can use Z-score that we will explain in following to identify the outliers. A value of data point that is greater than 3 or less than -3 is labelled as an outlier.
SPSS:
Graphs→ Legacy Dialogs→ Boxplot→ Simple (check the summarised of separate variables) or
Analyze → Descriptive Statistics → Explore → First, move the dependent variables into the Dependent List box and then move the independent or group variable (if there is any) into the Factor List box → Statistics → check Outliers → Continue → ok
We run the procedures above for the example 3:
As you can see, there is no outliers in the dataset.
- Standard Scores (Z-Score)
A standard score indicates how many standard deviations a data value is from the mean. We are also able to apply standard scores for comparing two data values from two different datasets and distributions. For example, we have two separate groups of males and females who have heart disease. If we wish to compare a female’ serum cholesterol with a man’s serum cholesterol, first the two data values should be standardized then compared. That means we take other data values from each group and the variation in values into account. The following formula is used for calculating Z-scores:
For population: z=x-μ
when z is z-score, x is the value of data, is the population mean, and is the population standard deviation.
For samples: z=x-xs
When z, x, x, s are z- score, the value of data, the sample mean, and the sample standard deviation respectively.
Suppose a student wish to compare his math score with French score. If his score in math is 67 and in French is 69, he has to standardize two scores before comparing since each score comes from different distributions. In another word, we are not supposed to compare the raw data from different dataset without considering variation of the data. We can not say the student did better in French since 69>67. Suppose the mean and standard deviation of math class are 69 and 10 respectively and the mean and standard deviation of French class are 75 and 11 respectively. The following are the z-scores of math and French scores.
Math: z=67-6910=-0.2 and French: z=69-7511=-0.54
As we compare the two z-scores we can see indeed the student did better in math since -0.2>-0.54.
The Z-score is also used to identify any outliers. If the standardized data value is less than -2.68 or greater than 2.68, it is labeled as an outlier. The following procedure in SPSS generates the Z-score of the data value
The Normal Distribution
The most important statistical distribution is the normal distribution. In terms of underlying assumption for many statistical tests and the most frequent used distribution. For example, many human characteristics are normally distributed such as weight, height, blood pressure and so on. We start with an example. “A normal resting heart rate for adults ranges from 60 to 100 beats a minute”. According to Edward R. Laskowski, M.D. (http://www.mayoclinic.org/healthy-lifestyle/fitness/expert-answers/heart-rate/faq-20057979). therefore, physicians determine whether to have more investigations for patients who have heartrate out of this range. Furthermore, the normal distribution enables us to calculate the probability of having heartrate less or greater a value or even between two specific values. Suppose there are 15 men in their thirties that we measure their heartbeat. The following data is fictitious.
69.99, 74.80, 73.48, 73.02, 64.79, 65.28, 73.25, 83.99, 73.97, 73.26, 73.13, 68.62, 82.85, 70.20, 72.54
First, we generate the histogram to determine the shape of data as well as measures of central tendency.
Output:
Statistics | ||
Heartbeat | ||
N | Valid | 15 |
Missing | 0 | |
Mean | 72.8783 | |
Median | 73.1279 | |
Mode | 64.79a | |
Std. Deviation | 5.24588 | |
a. Multiple modes exist. The smallest value is shown |
The histogram is approximately symmetric and the mean and median are close. We increase the number of sample size to 50 people and the following histogram is generated.
The histogram appears to be more symmetrical. This time we increase the sample size to 500 and generate the histogram.
The first figure indicates the distribution of data is more symmetrical and gradually by increasing the data size the distribution looks like bell shaped curve. As the curve presented in the second figure called The Normal Curve. The more the curve covers the columns the more data is normally distributed. Since the normal curve is supposed to cover 100% of the data, the area under the curve is equal to 1 or 100%. In every normal distribution
- About 68% of the data falls within 1 standard deviation of the mean.
- About 95% of the data falls within 2 standard deviation of the mean.
- About 99.7% of the data falls within 3 standard deviation of the mean.
Resultantly, the continuous variable of heart beat has a normal distribution. Many continuous variables are approximately normally distributed such as height, weight, and so on. The normal curve graph depends on two measures the mean and the standard deviation. The mean indicates the center of the graph and the standard deviation indicates the height and the width of the graph. The larger standard deviation generates the taller curve and vise versa the smaller standard deviation generates the shorter curve. Therefore, a normal distribution is described by two important parameters the mean and the standard deviation. Suppose μ and σ are the mean and standard deviation respectively from a normal distribution. We write the distribution as Nμ,σ. When the data is not symmetric and skewed to the right or left, the curve indicates the skewness. That means the curve has one tail as opposed to two symmetric tails. Recall the figures represented in the skewness section. The figure3 is almost symmetric and the majority of data is close to the mean. When the data values are evenly distributed about the mean thus the distribution is normal and symmetric. The three measures of central tendencies are equal and the graph is bell curve.
When the mean is 0 and the standard deviation is 1 the normal distribution is, standard and is called the standard normal distribution. As mentioned earlier, the normal distribution enables us to calculate the probability of the data values are less than a value or greater a value or even between two values by measuring the corresponding area under the normal distribution curve. Indeed, we standardize the normal distribution and use the normal probability table. Since every normal variable such as X with the mean of and standard deviation of , can be converted to variable of Z with the standard normal distribution.
If X~ Nμ,2 therefore Z=X-μ~ N0,1.
Although statistical software or calculator enables us to calculate these problems instantly. For example, for heartbeats data the probability of the data that is less than 75 (pX<75=0.489) indicates there is 48.9% chance that a person has less than 75 heartbeats. To calculate the probability, we can use the Free Statistics Calculators or SPSS. Suppose for the heartbeat data, the data is normally distributed with the mean of 74 and standard deviation of 15 (N(74,15)). We wish to know:
- The probability of people who their heartbeats are less than 65 (p(X<65))?
- The probability of people who their heartbeats are between 64 and 68 (p(64<X<68))?
- The probability of people who their heartbeats are 77 or greater p(X≫77))?
- The value corresponding to 80th percentile of this distribution.
The following indicates the procedures:
- Free Statistics Calculators → Normal Distribution → Cumulative Distribution Function (CDF) Calculator for the Normal Distribution (fill the boxes for example for question 1 (74, 15, 65)).
- SPSS procedure:
- Define three variables Mean, SD, and X in the variable view
- In data view enter the value of the variables in this example (74, 15, 65) for question 1.
- Transform → Compute Variable → in target variable box, choose a new name for example “Probability 1” → in Function Group, choose CDF & Noncentral CDF → in Function and Special Variables, choose Cdf. Normal → fill the Numeric Expression box with the values of X, mean, and standard deviation respectively → Ok
SPSS creates a new column named “Probability 1” with the value in data view.
Solution:
- p(X<65)≈0.27 that means there is 27% chance that a person has less than 65 heartbeats.
- we use the same procedure above to find p(X<68) and p(X<64) then subtract the values.
p(64<X<68)= p(X<68)- p(X<64)=0.344-0.252=0.092=9.2%
- Since the procedure above calculates the probability less than a value for the third question we first calculate pX<77. Then we subtract it from 1 since the total area under the normal curve is 1. pX≫77=1- pX<77=1-0.579=0.421=42.1%
- We are looking for the value that 80 percent of the data is less than that.
- pX<a=0.8 →a=86.62 that means 80 percent of people have less than 86.62 heartrate beats per minutes.
We use SPSS (invers normal distribution) with the following procedure since this question is inverse of the questions above.
- Define three variables Mean, SD, and P in the variable view
- In data view, enter the value of the variables in this example (74, 15, 0.8) for question 4.
- Transform → Compute Variable → in target variable box, choose a new name for example “X” → in Function Group, choose Inverse DF → in Function and Special Variables, choose Idf. Normal → fill the Numeric Expression box with the values of P, mean, and standard deviation respectively → Ok
SPSS creates a new column named “X” with the value in data view.
As mentioned above the probability between two values for example a and b is the same as the area under the curve between two vertical lines cross the X-axis at a and b. therefore, the probability of the exact value is always 0 since there is one vertical line and no area under the curve.
pX=a=0 when X~N(μ,σ)
Many parametric statistical methods that are explained later require that the variables approximately normally distributed.
The following numerical and visual outputs must be checked:
- Skewness and Kurtosis z-values should be between -1.96 to +1.96
- The Shapiro-Wilk test p-value should be above 0.05
- Histogram, Normal Q-Q plots, and Box plots
To calculate the skewness and kurtosis z-value, divide the skewness or kurtosis value by its standard error that should be between -1.96 to +1.96.
SPSS:
Analyze → Descriptive Statistics → Frequencies (move the variable to the right) → Statistics (check skewness and Kurtosis) → Continue → Ok
Output:
Statistics | ||
Blood Glucose Level | ||
N | Valid | 35 |
Missing | 0 | |
Skewness | .284 | |
Std. Error of Skewness | .398 | |
Kurtosis | -.478 | |
Std. Error of Kurtosis | .778 |
z-values for skewness and kurtosis respectively: zskewness=0.2840.398=0.7135 and zkurtosis=-0.4780.778=-0.614 which the two values are between -1.96 to +1.96 that means the data is normally distributed.
We use SPSS to run the normality test for the dataset from example 3:
SPSS:
Analyze→ Descriptive Statistics → Explore→ plot (check Normality with plots and histogram(Shapiro-Wilk))
Output:
Tests of Normality | ||||||
Kolmogorov-Smirnova | Shapiro-Wilk | |||||
Statistic | df | Sig. | Statistic | df | Sig. | |
Blood Glucose Level | .077 | 35 | .200* | .976 | 35 | .639 |
*. This is a lower bound of the true significance. | ||||||
a. Lilliefors Significance Correction |
The P-Value for the Shapiro-Wilk test (0.639>0.05) is not statistically significant that means we do not reject the null hypothesis which is the data is normally distributed. (we will explain the hypothesis and P-value later). The following graph is Normal Q-Q plot that indicates most of data points are along the line or close to the line. In the ideal case, all the data points fall over the line.
Types of Variable
Suppose a researcher wishes to identify a cause and effect relationship between two variables. Therefore, the variable that is considered as cause is being manipulated by the researcher to evaluate the effect on the other variable. The variable is manipulated called independent variable while the other variable is being observed for any effect called dependent variable. Independent variable sometimes called explanatory or experimental or predictor variable or factor and dependent variable also called response or outcome variable. For example, a researcher wishes to evaluate the relationship between weight and blood pressure. In this study, weight is independent variable while blood pressure is dependent variable. It is worth noting that association between variables does not always mean cause and effect relationship.
Observational research
In this research, the researcher merely observes one or more variables of individuals and measures them in order to analyze and interpret the data obtained from the variables or even find out any possible relationship between them. The data might come from either form the past (retrospective study) for example birth weight of 20 12-year children or present and future (prospective study) that collecting data might take many years. For example, a researcher wishes to study about some desire variables of 20 newly born children until they will turn 10 years old.
Experimental (interventional) research
Researchers conduct this type of study to investigate any causal relationship between variables. They control the procedure of the research in terms of manipulating one variable (independent variable) to find out any effect on another variable (dependent variable), deciding which individual is assigned to which group, controlling the variables that they do not desire to interfere the research, and which treatment is assigned to which group. In this controlled research, the researcher divides the sample into two (or more) groups and assigns a specific treatment to one (or more) group called treatment group. Whilst only one group, called control group is assigned to receive placebo (fake treatment for reducing emotional effect) or regular treatment or sometimes neither of them. If this research is based on randomization of assigning the individuals and treatments to the groups which is significant for investigating cause and effect, it is called randomized experiment. Sometimes the researcher in order to reduce bias and effect of confounding variables divides the sample evenly such that each group has the equal number of individuals with certain variables. For example, we wish to study about the efficacy of a new drug for hypertension. Suppose there are 50 volunteer patients that are supposed to be assigned into two groups of 25 people. Suppose further that there are 20 patients who are very older than others within the sample. In this case, we first group the patients into two blocks. One block of 20 old patients and the other block has the 30 rest. Then we assign half of each block randomly to each group. That means each group has 10 old patients and 15 younger patients. This method is called blocking that reduce bias and the effect of other variables that might affect dependent variable since they have correlation with both dependent and independent variable called confounding variables. This method enables the researcher to reduce the variability between the groups and obtain similar groups. The study is called blinding when researchers do not inform the patients which treatment they are assigned to receive. If neither patients nor doctors know about treatments and groups called double blinding. Researchers attempt to reduce bias to obtain accurate result and interpretation. To do so, they apply randomization, controlling, replication (sufficient large sample size), blocking, binding or double blinding over the experiment.
Point Estimates
As it is mentioned earlier, there are measures of central tendency and variability for both population and samples. These measures for population such as population mean or population variance, are parameters and the value of theses parameters are constant since there is one population. Although we are less likely to know the exact value of the parameters since we do not study the entire population. The statistics however, are the characteristics of the sample, such as the sample mean. The value of a statistic differs from one sample to another sample since we can have different samples from one population. Each sample generates, for instance, one sample mean that varies from another sample. Therefore, the statistics are random variables since they come from different samples that we select randomly. The following diagram shows a population and four random samples with corresponding mean and variance. It indicates that there is one population mean (), one population variance (2), and that we are interested in estimating these parameters if they are unknown. This diagram also indicates that there are four or more possible random samples, that each one generates a mean and a variance with different values from one sample to another. We select one appropriate sample that is a proper representative of the population and use the mean and variance of this sample to estimate the mean and variance of the population.
The sample mean and variance are the point estimates of the population mean and variance. They are not exactly equal to the true value of the population parameters. However, with increasing the sample size, the values of point estimates are getting close to the values of the population parameters.
The Central Limit Theorem (CLT) and Sampling Distribution
Suppose we take five random samples with a specific size from a population with two parameters (μ,σ), which are the mean and standard deviation of population. Therefore, there are five-sample means that if we generate the histogram of them we can see it is more symmetric than the population histogram. The mean of these five-sample means is almost equal to the population mean. We increase the number of samples from 5 to 20 or even 100. We can see the distribution of these sample means called sampling distribution is approaching to be more symmetrical around the mean of population. the sampling distribution can be even more symmetrical when we increase the sample size. There fore:
If all the possible samples randomly taken from the population with two parameters (μ,σ) are sufficiently large enough (at least 30), then the sampling distribution of the sample means is normal or nearly normal regardless of population distribution. In addition to being large enough, the samples must be independent. To meet the independency assumption, the samples size should be less than 10% of the population size. The mean of sample means is equal to the mean of the population and the standard deviation of the sample means called standard error denoted by “SE or SE (X)” is equal to standard deviation of the population divided by the square root of the sample size. That means:
X= and SE=σX=√n and X~N(μ,√n) or Z=X-μn~N(0,1)
As the formulates above indicate that the mean of these sample means is equal to the mean of the population, whereas the standard deviation of the sample means (standard error) is 1√n of the standard deviation of the population. That means the dispersion of these sample means is less than population dispersion. If the standard deviation of population (σ) is unknown, the point estimate of the standard deviation population (s) from a sample of at least 30 observations can be used. The standard error indicates the variability of the sample means or the error of the point estimate since it shows the distant between the true parameter and point estimate.
Interval Estimates
As we mentioned earlier, point estimates from a sample are not completely reliable estimates of population parameters since it depends on the sample and the sample size. The point estimate is a value that might not be even close to the true parameter, however we can provide a range of values that more likely includes the parameter. For example, suppose you wish to meet a friend after work around 7 pm. You are off at 6:30 and usually it takes you 30 minutes to get the meeting place. How likely are you to make it if you set the time at 7 sharp as opposed to setting 6:55 to 7:05? That means you would meet your friend after 6:55 and before 7:05 and this range gives you more chance to be on time in case of what ever reason that might occur. Suppose you wish to estimate the average of duration of the trip then you randomly record 50 different durations as your sample. The average of these of numbers indicates the sample mean or point estimate. Different samples will generate different point estimates. If the duration is almost greater than 27 minutes and less than 35 minutes then the range of (6:57,7:05) is the interval estimate. Therefore, for the purpose of accuracy, interval estimate is used since we do not know how confident we are, that the point estimate is close to the parameter. Parameters of population are estimated by the point estimates which are specific numerical values, and the probability of being the same value is almost 0. However, we can find an interval that may or may not include the parameters of population which is called the interval estimate. For the example above, suppose you select 20 different samples, and each sample provides one point estimate. For each sample, you build an interval estimate around the point estimate. Suppose 19 of these interval estimates capture the true duration mean that means you are 95% confident, that you will meet your friend in the range of, for example (6:58,7:01). Thus, this interval estimate called confidence interval which is the interval estimate along with confidence. The percentage of the interval estimates to contain the parameter is called confidence level and is denoted by C. If you wish to be more confident that the interval captures the parameter, then you apply the lager confidence level (C). therefore, the interval would be wider and means the larger confidence level generate the wider confidence interval.
Confidence interval
Confidence interval is used to estimate unknown parameter thus is kind of interval estimate however to build this interval we use a sample from the population and specific confidence level. We build this interval around the point estimate. Confidence interval for any confidence level is:
Point estimatez2SE when =1-C and SE is standard error and z2SE is called margin of error denoted by E.
Confidence interval of the Mean for the specific α when ơ is known
x-z2(n)<<x+z2(n)
When x is the sample mean, ơ is the population variance, n is the sample size, α=1-C, z2 is used in the general formula for confidence interval and α is the total area in both tails of the standard normal distribution curve.
For example, for 95% confidence interval we can be 95% confident that the value of the population mean is between this interval when the variables are normally distributed in the population. The value of z2 can be found in every statistic normal table or by the software.
Since we can’t use SPSS we use Free Statistics Calculator for this case.
Confidence interval→ Confidence Interval Calculator for the Population Mean (when Population std dev is known) or http://www.socscistatistics.com/confidenceinterval/Default3.aspx.
Margin of error or Maximum error for estimate
The maximum difference between the actual population parameter and the sample estimate of the parameter “E” and has this formula:
E=z2(n)
Assumption for finding a confidence interval for the Mean when ơ is known
- The sample is a random sample
- Either n≥30 or the population is normally distributed if n<30.
Sample Size
Since we are not able to study all the population we started to study a sample from the population but the question is that what is the best size of sample. It is not an easy question but we know it depends on three things.
- The margin of error
- The population standard deviation
- Degree of confidence
We can find the following formula from Margin of error formula
n=z2.σE2
Most of the time we don’t know about the population standard deviation (ơ) so we can estimate it from:
- Previous studies so we can use standard deviation (s) using the sample.
- We can conduct a pilot study as a small scale preliminary study to find the sample standard deviation from this small study.
- Use the common guess which is dividing the range (high data – low data) by 4.
The formula mentioned above for confidence interval for the population mean is used when the population variance is known, although the point estimate of the unknown population standard deviation, the sample standard deviation (s) is used.
t distribution
According to central limit theorem, X is normally distributed with the parameters of and n or we can standardize the variable to have: Z=X-μn~N(0,1) with the assumptions of sufficiently large sample size and independency. However, we are not able to use the Z score mentioned above, when the population standard deviation () is unknown and the sample size is less than 30. Thus, we use the point estimate of the population standard deviation (s), so we have new statistic called t or t-score which is t=X-μsn with the new distribution called t distribution or t student. Therefore, the t distribution is a sampling distribution for the t statistic. The population should be normal or approximately normal or if it is not normal, the sample size should be greater than 30. Following indicates some similarities and differences for t distribution and standard normal distribution.
Similarities:
- They are bell shaped.
- They are symmetric about the mean.
- Measures of central tendencies for both are equal to 0.
Differences:
- In t distribution, the variance is greater than 1.
- t distribution is based on the degrees of freedom which is related to sample size as the sample size increase the t distribution approaches the standard normal distribution.
Thus, the formula for the confidence interval for the Mean when population standard deviation is unknown is:
x-t2(sn)<<x+t2(sn)
When the degree of freedom is: d.f.=n-1.
The assumption for this formula is similar to the pervious one so we use that.
SPSS:
Analyze→ Descriptive Statistics→ Explore→ Statistics
For the dataset from example 3, the confidence interval for the population mean using the procedure above is:
Output:
Descriptives | ||||
Statistic | Std. Error | |||
Blood Glucose Level | Mean | 90.2594 | 1.57735 | |
95% Confidence Interval for Mean | Lower Bound | 87.0538 | ||
Upper Bound | 93.4649 | |||
5% Trimmed Mean | 90.1813 | |||
Median | 90.0975 | |||
Variance | 87.081 | |||
Std. Deviation | 9.33174 | |||
Minimum | 71.97 | |||
Maximum | 108.95 | |||
Range | 36.98 | |||
Interquartile Range | 11.82 | |||
Skewness | .284 | .398 | ||
Kurtosis | -.478 | .778 |
Confidence interval and Sample Size for Proportion
A proportion refers to the fraction, decimal or percentage of the population that has a certain attribute. For example, a proportion of male patients or a proportion of people who have heart disease. Since it is a characteristic of a population, it is a parameter of the population and we might wish to estimate it by the sample proportion.
These symbols are used:
P=population proportion
p=sample proportion
p=xn q=n-xn p=1-q
When x is the number of sample that has a certain attribute and n is the sample size.
In this case the margin of error is:
E=z2pqn
Formula for a Specific Confidence Interval for a Proportion
p-z2pqn<P<p+z2pqn
When np≥5 and nq≥5
Sample size for Proportion
n=pq(z2E)2
In this case if we have p from the pervious study we use that in the formula for sample size otherwise we can use 0.5 to get the sample size.
Unfortunately, we can’t use SPSS for confidence interval for proportion hence, we use R:
> prop.test(X=….., n=……, conf.level=…..)
Confidence Interval for Variances and Standard Deviations
As it was already mentioned, in addition to the population mean (measures of central tendency), we might be interested in estimating the population variance or standard deviation (measures of variability). For example, a doctor may wish to know about the variation of heart beats of people rather than the average of the heart beats. As we mentioned earlier the point estimate for the population variance (2) is the sample variance or s2 and likewise the sample standard deviation (s) is the point estimate of the population standard deviation ( σ). Although to calculate the confidence interval for variance and standard deviation and we need to study about chi-square distribution. Suppose a sample of size n selected from a population that normally distributed with the parameters of (μ,2), the chi-square statistic denoted by 2, with the following equation has chi-square distribution with n-1 degrees of freedom. This distribution is also a family of curved and with the increase of degrees of freedom, the distribution is becoming more symmetrical and approaching a normal distribution.
2=(n-1)s22 ~(n-1)2
This variable can not be negative and the distribution is not symmetric and skewed to the right.
Formula for the Confidence Interval for the Variance and Standard Deviation:
(n-1)s2right2<2<(n-1)s2left2 when d.f.=n-1
or
(n-1)s2right2<σ<(n-1)s2left2
We use 2 for the right2 and 1- 2 for left2 .
Unfortunately, we can’t use SPSS and R for this confidence interval but you can use Free Statistics Calculator to find the right2 andleft2 . The following is the procedure:
Chi-Square Distribution→ Critical Chi-Square Value Calculator (Degree of freedom=n-1, Probability Level=2 for the right2 and 1- 2 for left2 ).
The term Statistical Significance is complex. In statistics, the black and white (Y or N) answer to the research question is extremely difficult Most of the time statistics give a level of significance and reflected as P-value, which shows the probability of a value accurately rejecting the null hypothesis. If the experiment is designed for a confidence level of 95%, the P-value is ≤ 0.05; for 98%, the P-value is ≤ 0.01; if 99%, the P-value is ≤ 0.001. P-value is not the indicator of the size or importance of the observed effect. In simple words, P ≤ 0.05 indicates that if the experiment is repeated 100 times, 95% of the time the same results will be obtained.
Numerical descriptors include mean and standard of deviation as well as frequency and percentage for categorical data. Normal distribution is a bell shaped curved with equal distribution on both the left and right side of the mean. Normal distribution is a mathematical assumption that indicates which probabilities are made based on repeated measurements of specific variables. For example, it would tell us how many students did extremely well or extremely poorly on an examination in relation to the mean, median or mode, formed by the majority of the students’ scores. Although most of the time it is assumed by researchers that their data will have normal distribution with a bell-shaped curve, this is not always the case. Therefore, it is important to look for the following abnormalities in distribution of data.
First, distribution can be skewed to the left or right. When a median exceeds the, mean the curve is skewed positively to the right. In our example, this means that more students who took this examination have a higher score than the mean score. When the median is less than the, mean the curve will be skewed negatively to the left. In our example this indicates that more students have a lower score than the mean score.
The shape of the bell curve can also be challenged. It can be flat or slim with a high peak. This is called kurtosis. Flat curves are called playtokurtic and tall slim curves are called leptokurtic in relation to mesokurtic, which is a normal distribution. The curve appears flat when there is a large amount of variability in the measurement with a large number of standard deviation from the mean. Slim peaked curves occur when there is little variability in the measurement and the standard deviation is very small. The area under the normal curve on both sides of the mean reflects the units of standard deviation, also referred to as the Z score. In a normal distribution, the mean distribution has a Z score of 0 and a standard deviation of 1.0. In the calculated value of a normal distribution, the area under the bell on both sides of the mean with a standard deviation of 1 is 34.13%, with a standard deviation of 2 is 13.59%, and with a standard deviation of 3 is 2.15%. The calculation of these areas with a z- score will provide an indication as to the normality of the curve. Z- scores are also used to calculate sample sizes for experiments.
Bell curves are also used in education to determine percentile scores of students. If it is assumed that the mean is 50%, the distribution on both sides of the mean will indicate where a particular student is in relation to the mean score of the other students. In educational circles, normal distribution is also used to obtain other scores such as T scores.
Scatter plot (Scatter graph) is another form for assessment of the distribution of experimental data. It is when individual data points are plotted as a data point in an area between X and Y axis.
Statistical Analysis of experimental data is mostly about the relationship between the dependent variables acquired as a result of the effect of the independent variables. The cause-effect relationship is established not only by observing the effects in a treatment group, but also by comparing the effects with an identical group where all of the conditions of the experiment are met without the treatment (control group). Therefore, in the second step of data analysis is called inferential statistics. Inferential statistics is used to draw meaningful conclusions about the entire population. These inferences may include a simple yes/no answer to the scientific questions and hypothesis testing, estimate numerical characteristics of the data, describing https://en.wikipedia.org/wiki/Association_(statistics)relationship within the data as well as estimates of forecasting and predictions.
The most common inferential statistical analysis for the comparison of two independent groups is called unpaired t-test or student test. If the subject of investigation is only one group but measurements were taken before and after the treatment, the statistical analysis used is called paired T-test. If the treatment is applied to several (2-3 or more) independent groups the statistical analysis required is ANOVA (Analysis of Variance). In other words ANOVA (analysis of variance) is a statistical test that shows if the mean of several groups are equal or if there are differences amongst groups means and associated procedures. This is like a T-test but for more than 2 groups, and calculates variations amongst and between the groups.
ANOVA for a single factor, also called one-way ANOVA, is the study of the effect of a factor on different groups. ANOVA multi-factor or 2 way ANOVA is the use of statistics than analyzes multiple factors on several groups. When the experiment includes observations of the variety of the levels of the same factor, it is called factorial.
In case the effects of treatment are observed in one group but repeated in several intervals, the appropriate statistical package will be a Repeated Measure ANOVA.
In complex conditions when the effect of several treatments is observed in several groups or in several time intervals, and the researcher wishes to identify the cross factor effects among all treatments and groups, ANOVA with Orthogonal Contrast is applied.
Correlation is dependence between two random variables or two sets of data. Correlation is used to indicate predictive relationships between variables or is not related to causality. In practice the degree of correlation is important and and is calculated by persons correlation coefficient, which is speculative as to the strength of relationships amongst variables.
Regression analysis is a statistical process for estimating the relationship among dependent and independent variables or variances. It explains how the value of dependent variables are changed when the effect of one of the independent variables is blocked or controlled. It may predict or forecast the causality but one must be extremely careful when expressing this type of relationship as a cause-effect relationship. The main difference between correlation and regression is that correlation is all about the relationship between values and variables (X and Y) and it doesn’t matter where the position of X and Y is. Regression on the other hand is all about the effect of one variable independent on a dependent variable as a one-way street. Therefore it matters where the X and Y are located. Correlation cannot speak to causality but regression is all about causality and cause and effect relationships.
At times, when the distribution of data is skewed or the numbers of experimental units are small or when the effect of the treatment is measured not by integer but by categorical data, nonparametric statistics are applied.
Nonparametric statistics are where probability distribution is not based on preconceived parameters, which is why nonparametric tests don’t make assumptions on probability distributions. The parameters for calculation in the nonparametric world is to not come from the data but is generated from the data. Some statistical analysis with nonparametric statistics that make me useful are listed below:
Histograms – Show nonparametric probability distribution in a graph. Carl Pearson described this data for nonparametric use.
Man Whitney’s U-Test – For the comparison of two samples coming from the same population with asymmetric distributions, when one population has a larger value than the other. It is essentially a T-Test for nonparametric values. In this test, a median is used instead of a mean.
Nonparametric regression
Cohan’s Capa – This test measures interrated agreement for categorical items.
Freedman’s test – A nonparametric version of the two-way analysis of variance. It determines that when K treatment is in randomized block designs and have identical effects.
Corcoran’s Q test – A test in which K treatment is in randomized block design with 1-0 outcomes having identical effects. This is used for the analysis of two-way randomized block design, where the response variable can take only two possible outcomes (0, 1).
Crisco Wally’s test – A one-way ANOVA nonparametric word, testing whether less than two independent samples are drawn from the same distribution. This is an extension of the Man Whitney U-test where there are more than two groups.
Candles W-test – Used for assessing agreement amongst raters, much like the person’s correlation coefficient but does not require probability distribution and can handle any number of distinct outcomes.
Hypothesis Testing
Hypothesis testing is one of the most important and useful subject of inferential statistics. A Statistical Hypothesis is a premise or claim, that a researcher wishes to test and is mostly about population parameter. As mentioned earlier, a researcher conducts an observational or experimental research to evaluate the hypothesis of a population parameter or compare two or more groups parameter. Sometimes it might be an assumption about the population parameter obtained from the pervious study or any claim that the researcher wishes to know whether the assumption or claim is still valid or not. These hypotheses, that the researcher is interested in testing, called statistical hypotheses. Statisticians use the hypothesis testing method to make decision about the assumption or claim. In every statistical hypothesis test, there are two mutually exclusive hypotheses called the null hypothesis and the alternative hypothesis. Subsequently, we always reject or fail to reject the null hypothesis based on the evidence (data). Thus, our evidence that enables us to make a decision about the null hypothesis, is the data collected from the population as a sample. In hypothesis test the researcher must indicates following steps:
- Define the population under study
- State the hypotheses that will be studied
- Significant level
- Select the sample from population
- Collect the data
- Compute the statistical test
- Reach a conclusion
- Make the decision
The Null Hypothesis
It is denoted by H0, and usually is the default hypothesis which is currently accepted or in another word that expresses no difference between parameters or a skeptical claim or somehow a conventional hypothesis.
The Alternative Hypothesis
It is denoted by H1 or HA , which expresses the difference between parameters or any change or a new claim.
We either reject the null hypothesis or fail to reject it. That means, the decision is always based on the null hypothesis. The null hypothesis will not be rejected unless there is sufficient and strong evidence in favor of the alternative hypothesis. Although, the rejection of the null hypothesis does not mean the acceptance of the alternative hypothesis or it is true, since the decision is made, based on the sample not on the entire population. Thus, we have two possible outcomes of this hypotheses.
- Reject the null hypothesis
- Fail to reject the null hypothesis
The claim can be considered as either the null hypothesis or alternative hypothesis, thus, to express the decision if the claim is considered as the null hypothesis, we say there is or is not enough evidence to reject the claim and if the claim is considered as the alternative hypothesis, we say there is or is not enough evidence to support the claim. For example, a researcher wishes to evaluate the efficacy of a new drug to lower blood pressure. Suppose, he knows from the previous study the average of blood pressure for these patients under study with hypertension, is usually 140/90 mm Hg. Thus, the claim is, the average of blood pressure is less than 140 after taking the new drug. Therefore, the hypotheses are:
H0: μ=140 v.s H1: μ<140
The null hypothesis states that the mean will remain unchanged and there is no efficacy, and the alternative hypothesis states that the new drug lowers the blood pressure. This test is called a one-tailed or left-tailed test.
If the researcher is just interested whether the mean changes or even increases, the hypotheses for these situations are respectively:
H0: μ=140 v.s H1: μ≠140 (two-tailed test)
H0: μ=140 v.s H1: μ>140 (one tailed test or right-tailed test)
The best way to decide about the null and alternative hypothesis is to examine the whole population. Since this way is almost impossible we try to make a decision based on a random sample from the population. Therefore, we might make a mistake since the sample may not be enough or a sufficient sample which expresses the population.
Decision Errors
Two errors can be found after decision:
- Type ǀ error. It occurs if we reject a null hypothesis when it is true.
- Type ǁ error. It occurs if we fail to reject a null hypothesis when it is false.
There are four possible outcomes after making the decision that is shown in the following table.
H0 true H0 false
Error type ǀ | Correct decision |
Correct decision | Error type ǀǀ |
Reject H0
Fail to reject H0
Significant level
The probability of committing a Type ǀ error is called the significant level which is denoted by α.
There are three common values for α that is usually used. 0.01, 0.05 and 0.1 which means if the null hypothesis is rejected, the probability of committing a Type ǀ error will be 1%, 5% or 10% respectively.
We already mentioned the level of confidence denoted by C which means how confident we are in our decision. We can have this equation that C=1-α.
Power of the test and Beta
The probability of committing a Type ǁ error is called Beta which is denoted by β. Subsequently, the probability of rejecting the null hypothesis when it is actually false is the power of the test which means the decision is made correctly. Mathematically the power of the test is, 1- β and the best value by convention is 0.8 or greater which “0.8” is standard. Although, the values of α, β, and the statistical power of test are always between 0 to 1 since they are probabilities.
Effect size
The effect size is the difference between the critical value (true value) and the value indicated by the null hypothesis when the null hypothesis is false. So, we can say effect size is a measure of the strength of an effect (by convention Cohen’s d effect size of 0.2, 0.5, and 0.8 are considered small, medium, and large, respectively). The effect size is computed when the null hypothesis is rejected otherwise is meaningless. We can use Free Statistics Calculator or http://www.socscistatistics.com/effectsize/Default3.aspx to calculate the effect size.
Having the power of the test, the effect size, and significant level and using Free Statistics Calculator we can calculate the minimum of the sample size.
Test Statistic is computed using the data from the sample to make the decision about the claim and the value is called test value or test statistics.
The critical value separates the critical region from the noncritical region with the symbol of C.V.
The critical region or region of rejection is a range of values that if the test value falls in this region the null hypothesis is rejected.
The noncritical region or region of acceptance is a range of values that if the test value falls within this region the null hypothesis is not rejected.
The P-Value is the probability of getting the result from the sample in support of the alternative hypothesis, if the null hypothesis is true.
As we mentioned above, after collecting data or our evidence, some methods are applied to make the decision. We can represent the critical region. If the value of sample mean falls in the critical region, that means the null hypothesis will be rejected otherwise we fail to reject the null hypothesis. The other method is confidence interval. We compute the confidence interval and if the value of population mean based on the null hypothesis falls in that interval, we do not reject the null hypothesis. Finally, the P-Value method that is more common since almost all the statistical software compute it.
Suppose the last study about the average weight of newborn babies is 7.5 pound (the mean of population (μ)) with the standard deviation of 1.3 pound (σ). A researcher claims this average is no longer valid because of the new diet methods, nutrients, and vitamins taken during the pregnancy. Therefore, he tried to collect enough evidence to support his claim. The following are hypotheses:
H0 : μ=7.5 v.s H1: μ>7.5 (right-tailed test)
The researcher randomly selected 100 newborns and recorded their weights. The mean of this sample is 7.76 pound (x=7.76) and greater than 7.5. However, he doesn’t know yet that is statistically great enough to reject the null hypothesis and support his claim. The following histogram represents the distribution of the sample.
As the histogram above represents, the data are left skewed however, the number of data is sufficiently large (n=100) to meet the assumptions for the central limit theorem since the data are selected randomly as well. According to the central limit theorem, the distribution of the sample mean is approximately normal with the mean of 7.5 pound and the standard error of 1.3100=0.13. Suppose, he considered to test the hypotheses with the significant level (α) of 0.05. In summery:
H0 : μ=7.5 v.s H1: μ>7.5 (right-tailed test)
σ=1.3, n=100, x=7.76, 0=7.5 hypothezied population mean, X~N7.5,0.13, α=0.05.
The critical value or C.V. in this test is 7.71, that means a value obtained from a sample which is greater than this value is in critical region and cause the null hypothesis to be rejected. In another word, we do not reject the null hypothesis unless the sample mean is greater than 7.71. (The following procedure in Excel computes the value of C.V.)
Type in one cell of spreadsheet: =NORM.INV(probability, mean, standard_dev) → enter
Probability in this command, is the area on the left side of the critical value under the normal curve. If the hypothesis is right-tailed, the critical value and corresponding region are on the right side of mean and we use 1- α for the probability. The left-tailed hypothesis though, the C.V. and corresponding region are on the left side and we use α. For two tailed hypotheses, the critical region is divided by two, one half in the right and the other, in the left side of the mean. We apply (1-2) for the right side and 2 for the left side.
In this example: =norm.inv(0.95, 7.5, 0.13) → enter that gives us 7.713831~7.71.
The following normal curve of sample mean if the null hypothesis is true, represents the critical value which is on the right side of mean (7.5) and red area represents the critical region.
Critical value=7.71
As you can see in this example, the sample mean is x=7.76 which is greater than C.V.=7.71 and falls in the critical region. Therefore, the decision is to reject the null hypothesis or the evidence is strong enough to support the claim.
The P-Value Method:
The yellow area (right tail) under the sample mean (x=7.76) represented by the figure above, is the P-value if the null hypothesis is true in the direction of the alternative hypothesis. The p-value indicates the probability of obtaining a sample mean of 7.76 or greater (since H1=μ>7.5), if the null hypothesis is true. For this example, the P-value is 0.0228 (1-norm.dist(7.76,7.5,0.13,1)) and as you can conclude from the figure above if the sample mean is greater than 7.76, the p-value will be smaller that means the evidence is stronger to reject the null hypothesis. On the contrary, smaller sample mean which is closer to population mean (7.5) generates the larger area and indicates the evidence is not strong enough to reject the null hypothesis. Excel compute the P-Value with the following command:
Type in a cell of the spreadsheet: =NORM.DIST(x, mean, standard_dev,True) → enter
The procedure above computes the left area of sample mean (7.76). If the test is right tailed test such as this example, the value obtained above should be subtracted from one to calculate the P-Value. If the test is left tailed test, the value obtained is the P-Value. For two-tailed test, the value should be doubled to obtain the P-Value.
We make a decision based on the critical region or the P-Value or confidence interval. Since we use software in this tutorial we use the P-Value method.
If the P-Value is less than α the null hypothesis will be rejected and if the P-Value is greater than α we do not reject the null hypothesis.
P-Value ≤ α→ Reject the null hypothesis
P-Value >α →Do not reject the null hypothesis
In this example, the p-value=0.0228<0.05 therefore, we reject the null hypothesis. The evidence is strong enough to support the claim which is the average of newborns has grown. We might compare the p-value to the significant level of 0.01 to be more restrict. In this case, since 0.0228>0.01, we fail to reject the null hypothesis which means the evidence is not strong enough to support the claim.
Suppose the true value of the population mean (μ) is greater than 7.5, for example 7.76. As mentioned earlier, the probability of failing to reject the null hypothesis, if it is false is type II error (β). therefore, in this example: β=p μ=7.76 which means the area in non critical region under the normal curve with the mean of μ=7.76.
The following procedure calculates in Excel:
Type in one cell of spreadsheet: =NORM.DIST(7.71, 7.76, 0.13, True) → enter
In this example, β=0.35, therefor the statistical power of the test is 1-0.35=0.65.
The following figure indicates the type II error as well as type I error and the statistical power of the test with the critical value of 7.71.
As the figure above indicates and we mentioned earlier, the sum of the power and the probability of committing type II error (β) is 1. When the statistical power of the test increases, the probability of committing a type II error decreases and vise versa. On the other hand, the border line between a type I error and a type II error is the critical value (C.V.). When the value of the probability of committing a type I error (α) decreases, the value of β increases. As you can see, decision errors (α, β) are inevitable, however we must aware of consequences of making each error. We also can refer the type I and type II errors to false positive and false negative respectively as they known in medical statistics. In each case we made a false decision when the natural result is negative and the alternative result is positive. The type I error is referred to a false positive result whereas the type II error corresponds to a false negative result. For example, for pregnancy test, a woman gets a positive test when she is not pregnant then a type I error has occurred. On the other hand, when a pregnant woman gets the negative pregnancy result a type II error has occurred. In terms of medical tests, we might prefer to decrease the probability of committing a type II error as opposed to a type I error. How ever, in some hypothesis tests rejecting the null hypothesis incorrectly (the type I error) might cost more than committing the type II error.
Finally, the effect size is a factor that we wish to calculate in any hypothesis tests as the greater the effect size the greater the statistical power of the test. In this example, the effect size is 0.2 which is small.
Effect size (Cohen’s d) = mean sample-population mean population standard deviation = 7.76-7.51.3 = 0.2
G*Power is a free software you can use to calculate the power of the test and the effect size for most of the statistical tests.
Parametric Tests
A parametric test is a statistical test that has a fundamental assumption which is normal distribution or sometimes specific probability distribution for population. For example, T Test, Z Test and F test are used to test population parameters such as mean, variance and proportion. Later we will explain the most common tests and how to use SPSS or Excel to run them.
Z Test for a Mean
If we wish to test the mean of a population, we have two statistical tests depending on the population variance. If the variance is known, we use the one sample Z test, otherwise we use the one sample T test. The null hypothesis states a hypothesized value for the population mean, whereas the alternative hypothesis defines a larger, smaller or a non-equal value. The procedure starts with selecting the sample and obtaining the require information from the data or in another word evidence. In general, the test value is:
Test Value=Observed Value-Expected or hypothzied ValueStandard Error
The observed value is computed from the sample data such as the sample mean and the expected value on the other hand, is the parameter such as the population mean, if the null hypothesis is true.
Therefore, the following indicates the hypotheses and the test value for the Z-score is
H0 : μ=0 v.s H1: μ≠0 (for one-tailed test, μ<0 or μ>0 are used) and z=x-0n where:
x is the sample mean,0 is the population mean when the null hypothesis is true, Ơ is the population standard deviation, and n is the sample size.
Assumptions for the Z test for A Mean when ơ is known
- The variable measured (dependent variable) should be quantitative.
- The sample is a random sample with no relationship between the data or observation.
- Either n≥30 or the population is normally distributed if n<30 with no outliers.
Example 4: Recall the example mentioned earlier about the efficacy of the new drug for lowering blood pressure. Suppose, 35 patients with hypertension are volunteer for this study and randomly selected. If the sample mean is 138/90 mm Hg and standard deviation of the population based on the pervious study is 20/90 mm Hg, is the new drug significantly effective? (α=0.05)
The hypotheses are: H0: μ=140 v.s H1: μ<140 (claim)
Confidence interval approach:
The sample mean, x=138 is the point estimate for the population mean and as we already discussed there is uncertainty about the point estimate duo to the sampling variation. Therefore, we compute the confidence interval for the population mean based on the sample. With the help of Free Statistic Calculator (Confidence interval→ Confidence Interval Calculator for the Population Mean (when Population std dev is known) or http://www.socscistatistics.com/confidenceinterval/Default3.aspx, the confidence interval for μ is (131.37,144.63). Since μ=140 from the previous study falls in this range, we do not have enough evidence to support the claim or we failed to reject the null hypothesis.
The p-value method:
We calculate the p-value to compare with the significant level. Since ơ is known we can’t use SPSS. However, we can use http://www.socscistatistics.com/tests/ztest_sample_mean/Default2.aspx that calculates the z-score and p-value. The p-value, calculated by the calculator is 0.2776 which is greater than α=0.05 and means we do not reject the null hypothesis or we do not have enough evidence to support the claim. As explained earlier, for the one sample z test we can use Excel to compute the power of the test and the type II error since we failed to reject the null hypothesis. The two following ways enable you to calculate them.
- Type the following formula in a cell of the spreadsheet to calculate the critical value: =NORM.INV(probability, mean, standard_dev (SE)) → enter
Therefore in this example, C.V.=norm.inv(0.05 (left tailed test) ,140 , 20/35)=134.44 then type in a cell of the spreadsheet: =NORM.DIST(C.V., x, SE, True) → enter
Therefore in this example, β=1-norm.dist(134.44, 138, 20/35, True)=0.85 (the value computed above should be subtracted from 1 since we need the area on the right side of critical value under the normal curve with the mean of 138 because the test is left tailed). And finally, the power of the test is equal to 1- β=0.146 which is very low and the type two error is high.
- Find the effect size with the following formula or using the following the following calculator: http://www.socscistatistics.com/effectsize/Default2.aspx
Cohen’s d=x-0=138-14020=0.1 and using excel with Real Statistics Resource Pack that can be reached with Ctrl-m in one spreadsheet in Excel then:
Click Misc → select Statistical Power and Sample Size → ok → check One-sample normal and Power → ok → fill the Input boxes with for example in this case (Effect Size=0.1, Sample Size=35, Tails=one, and Alpha=0.05 → ok)
T-Test for a Mean (one sample T-Test)
when the variance of the population is unknown, the T-Test is normally used for testing hypotheses. In this case the distribution of the variable should be approximately normal. Therefore, the test statistic for the T-Test is
t=x-0sn
With the degree of freedom d.f.=n-1.
The assumptions for one sample T-Test is the same as the assumptions mentioned for Z test:
The following is the SPSS procedure for this test:
SPSS
Analyze→ Compare Means→ One-Sample T test→ Option (use C=1-α)
Move the variables you want to test and fill the Test Value box with the hypothesized population mean.
Example 5: Supposedly from the previous study, Canadians take an average of 250 painkillers every year. A researcher wishes to know whether the study is still valid. He randomly selects 25 people to inquire about the number of painkillers taken by each person. The following data are generated by SPSS and fictitious.
267 211 217 217 435 378 189 86 225 215 313 331 10 240 210 337 276 323 216 132 310 194 69 224 261
At α= 0.05, is still the study valid?
We have one variable which is the numbers of painkillers, that shows how many painkillers each person takes per year.
We state the test:
H0: μ=250 v.s. H1: μ≠250 (claim)
The data is selected randomly and less than 10% of the population size thus the first and second assumptions are met since the variable is quantitative. However, since the sample size is less than 30, we need to do the test of normality. The following table indicates the normality assumption for the data is met since the p-values for both tests are greater than 0.05.
Tests of Normality | |||||||
Kolmogorov-Smirnova | Shapiro-Wilk | ||||||
Statistic | df | Sig. | Statistic | df | Sig. | ||
The numbers of painkillers | .155 | 25 | .122 | .964 | 25 | .500 | |
a. Lilliefors Significance Correction |
Therefore, we run the one sample t test with the Test Value of 250.
SPSS Output:
One-Sample Test | ||||||
Test Value = 250 | ||||||
t | df | Sig. (2-tailed) | Mean Difference | 95% Confidence Interval of the Difference | ||
Lower | Upper | |||||
The numbers of painkillers | -.759 | 24 | .455 | -14.56000 | -54.1293 | 25.0093 |
The t value is t=-0.759 and the p-value=0.455 (if it were a one-tailed test we would divide the p-value by 2) is not significant. Since the p-value is greater than α=0.05, we do not reject the null hypothesis because there is not enough evidence to support the claim that the average has changed. Since the null hypothesis is not rejected, computing the effect size is meaningless however, it is worth knowing about the statistical power of test and type II error. Excel with The Real Statistics Resource Pack (http://www.real-statistics.com) enables us to calculate the power of test, effect size, and type II error and so much more.
The following is the procedure for Excel:
First, we enter the data into the spreadsheet in one column.
Ctrl-m → on the Real Statistics panel click on Misc → T Test and non-parametric Equivalents → Ok → specified the data in Input Range 1, Alpha, Hyp Mean (250), one sample, T Test, and Ok →
The output generated is the same as above in addition Cohen’s d. the following is the output generated by Excel.
T Test: One Sample | ||||||
SUMMARY | Alpha | 0.05 | ||||
Count | Mean | Std Dev | Std Err | t | df | Cohen d |
25 | 235.44 | 95.86061061 | 19.17212212 | -0.759436014 | 24 | 0.151887203 |
T TEST | Hyp Mean | 250 | ||||
p-value | t-crit | lower | upper | sig | ||
One Tail | 0.227493493 | 1.71088208 | no | |||
Two Tail | 0.454986986 | 2.063898562 | 195.8706847 | 275.0093153 | no |
We need the effect size to compute the statistical power of test.
Ctrl-m → on the Real Statistics panel click on Misc → Statistical Power and Sample size → Ok→ check One-sample/paired t test and power → fill the boxes Effect Size (0.151), Sample Size (25), Tails (2), Alpha (0.05), Sum Count (it is 120 by default) → Ok → the statistical power is calculated as the following:
Power= 0.1128 which is low that means if you wish to have a greater power, you should increase the sample size. If you redo the procedures above and this time check the sample size instead of power, Excel recommend the sample size that gives you more power. In this example, for the statistical power of 0.8, we need a sample of 343.
Z-Test for a Proportion
In this case, we test a percentage of population or a population proportion (P) that has a specific character. For example, a researcher wishes to know the proportion of males in a under study population or the proportion of the patients who have the stent after a heart attack in the population of people who had heart attack. Therefore, the hypotheses are:
H0:P=P0 v.s H1: P≠P0 or
H0:P=P0 v.s H1: P>P0 or
H0:P=P0 v.s H1: P<P0
Since we have two outcomes (success or failure) for each individual and the probability of success does not change from trail to trail, this hypothesis test is binomially distributed and we can use normal distribution to approximate the binomial distribution when np≥5 and nq≥5.
Formula for the Z-Test for Proportion
z=p-P0P0q0n
where
sample proportion: p=x/n where x is the number of individuals of the sample who have specific character
Population proportion: P0
Sample size: n
q0=1-P0
Assumption for Testing a Proportion
- The sample is a random sample
- The condition for a binomial experiment are satisfied
- np≥5 and nq≥5.
Since we can’t use SPSS, R is used and the following is the formula in R:
>prop.test(x=…, n=…., p=….)
For one-tailed test:
>prop.test(x=…., n=…., p=…., alternative=”greater or less”)
If you want to have another confidence level you have to add for example conf.level=0.99
Example 6: Suppose a study claims that 25% of the population smoke at least once a month. A researcher wishes to find out the accuracy of this claim therefore he selects 80 people randomly and found that 13 people of his sample smoke at least once a month. At α= 0.01, the researcher tests the claim.
The hypotheses:
H0:P=0.25 v.s H1: P≠0.25
So, P=0.25, n=80, X=13, and α=0.01 →p=13/80=0.1625
The following procedure in R is used:
>prop.test(x=13, n=80, p=0.25, conf.level=0.99)
Output:
1-sample proportions test with continuity correction
data: 13 out of 80, null probability 0.25
X-squared = 2.8167, df = 1, p-value = 0.09329
alternative hypothesis: true p is not equal to 0.25
99 percent confidence interval:
0.07870269 0.30082548
sample estimates: p= 0.1625
Since the p-value=0.09329>0.01 we do not have enough evidence to reject the null hypothesis.
χ2 Test for A Variance or Standard Deviation
As mentioned earlier, the Chi-Square distribution is used to calculate a confidence interval for the population variance or standard deviation. A Chi-Square (χ2) test with the following formula for the t statistic is used to test about the variance of a population.
2=(n-1)s20 2
With “d.f=n-1” the degree of freedom, “n” sample size, “s2 “ the sample variance, and “σ0 2” the population variance if the null hypothesis is true. The following, indicates the two tailed, right tailed, and left tailed hypotheses respectively.
H0: 2=0 v.s 2H1: 20 2, H0: 2=0 v.s 2H1: 2>0 2,
H0: 2=0 v.s 2H1: 2<0 2
Assumption for the chi-square Test for a single variance
- The sample should be independently selected from the population at random
- The population should be normally distributed
Since SPSS can’t be used for this hypothesis, we use Free Statistics Calculator to find the P-Value that gives us the probability for right -tailed (p-value calculator for chi-square test). If the hypothesis is left-tailed, we should subtract that value from 1 and if the hypothesis is two-tailed, the value should be doubled.
Independent (unpaired) t-Test for two samples (Difference of two means)
Suppose we have two groups or two populations that do not share any individual. However, they both have one continuous variable which we are interested in comparing. For example, a medical researcher wishes to compare the average of life expectancy at birth between males and females. In terms of statistics, there are two independent groups males and females and the continuous variable we are interested in, is age. In this analyse, we have one dependent variable which is age and one independent variable which is sex, that has two categories or levels males and females. Following are possible hypotheses:
H0: 1= 2 v.s H1: 12 (two-tailed) or
H0: 1= 2 v.s H1: 1<2 (left-tailed) or
H0: 1= 2 v.s H1: 1>2 (right-tailed)
Assumptions for the independent T-Test to Determine the Difference Between Two Means
- The samples from each group should be randomly selected
- There should be no relationship between each group (independency)
- The dependent variable should be quantitative and independent variable should be categorical with two levels
- If the sample sizes are less than 30, the populations must be normally or approximately normally distributed
- No significant or extreme outliers
- Homogeneity of variances
Sometimes the populations variances are known (12,22), in this case, the test statistics is z and Z- Test is applied. The following formula is used to calculate the z value.
z=x1-x2-(1-2)12n1+22n2
Where: x1, x2 are the sample means, 1, 2 are population means, and 1,2 are populations standard deviations.
Formula for the Confidence Interval for Difference Between Two Means:
x1-x2-z212n1+22n2<1-2<x1-x2+z212n1+22n2
Since 1 and 2 are known we can’t use SPSS in this case. However, we can use Free Statistics Calculators “One-Tailed or Two-Tailed area under the standard normal distribution calculator”. Compute Z test first manually then put it in the box and click the calculate that gives you the P-Value.
Or use the Excel:
First enter two independent sample data into two columns then go to:
Data analysis→ z Test: two sample for Means (1-type the range for variable 1, variable 2 by clicking on the corresponding boxes and dragging the corresponding data 2- type the hypothesised mean difference for two groups (for example, if the null hypothesis is 1= 2 then type “0”) 3-type the variances of each group 4- choose the desire alpha 5- select Output Range and click the corresponding box then drag and select the cells you want to have your output on ) →ok
Testing the Difference Between Two Means of Independent Samples Using the T-Test
In many cases the population standard deviations are unknown therefore we use “T-Test” to test the differences between means when the two samples are independent. The populations are still normally or approximately normally distributed otherwise the sample size for each group should be more than or equal to 30.
The following formula is used when the variances are not equal:
t=x1-x2-(1-2)s12n1+s22n2
Where the degree of freedom is the minimum value of (n1-1) and (n2-1).
When the variances are equal:
t=x1-x2-(1-2)n1-1s12+n2-1s22n1+n2-21n1+1n2
Where: d.f.1=n1-1 and d.f.2=n2-1 and degree of freedom is the minimum of d.f.1 and d.f.2.
Confidence Interval for the Difference of Two Means (Independent Samples):
x1-x2-t2s12n1+s22n2<1-2<x1-x2+t2s12n1+s22n2
The degree of freedom is the minimum of d.f.1 and d.f.2.
SPSS:
Analyze→ Compare means→ Independent samples T Test→ (move your variable into the test variable box and group variable into grouping variable and define your groups for instant here are 1 and 2)
→Option (choose your confidence level) →Continue
Example 7: A study claims the average of serum cholesterol levels of males and females in their forties are equal. A researcher wishes to investigate about the claim, thus she randomly selects 25 and 35 females and males respectively. The following fictitious data obtained from these two independent samples. At α=0.05 test the difference between two means.
Hypotheses:
H0: 1= 2 (the claim) v.s H1: 12
Serum cholesterol level for the sample of 25 normal females:
176 197 212 200 198 232 226 108 150 159 205 138 174 183 187 204 168 240 210 185 183 215 241 200 216
Serum cholesterol level for the sample of 35 normal males:
215 198 200 200 267 250 191 159 202 199 230 235 135 207 197 237 218 233 199 173 229 192 153 202 213 272 179 203 226 173 213 215 211 248 204
As we explained, in analyse of independent T-Test, there are one dependent variable and one independent variable with two levels. First, we define the variables in variable view window. In this example, the independent variable – sex and the dependent variable – serum cholesterol level are assigned in two rows. Second, in the sex variable row click on the Values cell to create the Value Labels window. We choose a value of 1 for the female and a value of 2 for the male. In the value box, type “1” and in the label box “female”, then click on add and likewise for the male, then click ok. Thus, we identify our independent (group) variable levels with numbers. The second variable is “Serum Cholesterol Level” which is our dependent variable, since we can’t have gap in the Name column we just type “SerumLevel” and in the Label column we can type “Serum Cholesterol Level”. After that, go to the data view window and enter the data. In the Sex column, we type 1 for females and 2 for males. For example, in this example we have twenty-five numbers of “1” and thirty -five numbers of “2”. Finally, enter the corresponding data of each group. Therefore, two columns are created in Data View, Sex and SerumLevel. The first 25 data demonstrate the serum cholesterol levels for females and the last 35 data demonstrate the serum cholesterol levels for males. Before running the test since the sample size for the females are less than 30 we run the normality test for the data of females although SPSS runs the test for both groups. The following procedure used to test the normality of each group data.
Analyze → Descriptive Statistics → Explore (move the dependent variable (Serum Cholesterol Level) into the Dependent list box as well as independent variable (Sex) into the Factor list box → Plots (check Histogram and Normality plots with tests) → Continue → Ok
Output:
Tests of Normality | |||||||
Sex | Kolmogorov-Smirnova | Shapiro-Wilk | |||||
Statistic | df | Sig. | Statistic | df | Sig. | ||
Serum Cholesterol Level | Female | .087 | 25 | .200* | .976 | 25 | .797 |
Male | .127 | 35 | .163 | .977 | 35 | .653 | |
*. This is a lower bound of the true significance. | |||||||
a. Lilliefors Significance Correction |
As the table above indicates, the data of both females and males are normally distributed since both the p-values are insignificant as well as greater than 0.05. In addition to the table above, as it is explained in normality test, we can consider normal Q-Q plot and the value of skewness and kurtosis.
Use the following procedures to run the test.
SPSS:
Analyze→ Compare means→ Independent samples T Test→ (move Serum Cholesterol Level into the test variable box and Sex into grouping variable and define your groups for instant here are 1 and 2)
→Option (choose your confidence level in this example is 1-0.05=0.95 or 95%) →Continue
Outputs:
Group Statistics | |||||
Sex | N | Mean | Std. Deviation | Std. Error Mean | |
Serum Cholesterol Level | Female | 25 | 200.2800 | 24.74456 | 4.94891 |
Male | 35 | 207.9430 | 29.63418 | 5.00909 |
Independent Samples Test | |||||||||||
Levene’s Test for Equality of Variances | t-test for Equality of Means | ||||||||||
F | Sig. | t | df | Sig. (2-tailed) | Mean Difference | Std. Error Difference | 95% Confidence Interval of the Difference | ||||
Lower | Upper | ||||||||||
Serum Cholesterol Level | Equal variances assumed | .293 | .591 | -1.056 | 58 | .295 | -7.66299 | 7.25768 | -22.19081 | 6.86483 | |
Equal variances not assumed | -1.088 | 56.503 | .281 | -7.66299 | 7.04150 | -21.76604 | 6.44006 |
As the second table above indicates, Leven’s test is also provided for the test of equality for variances of two groups (females and males). Since the P-Value for this hypothesis .591>0.05, we do not reject the null hypothesis that says the variances are equal. So, we suppose that the variances are equal. As a rule of thumb, if the value of Sig. (in this example 0.591) is greater than , equal variances is assumed.
In the next column, we have the result for the t-Test for equality of Means. The P-Value is not significant since it is greater than significant level (0.216>0.05) which means, we do not have enough evidence to reject the null hypothesis or claim. Since we did not reject the null hypothesis, calculating the effect size is meaning less. Although, with the following procedure, we compute the power of the test and β, using SPSS.
Analyze → General Linear Model → Univariate (move the dependent and factor or group variables (independent variable) to the right boxes) → Option (move Overall and group variable (Sex) to the right box and check Descriptive Statistics, Estimate of effect size, and Observed power) → Continue → Ok
Output:
Tests of Between-Subjects Effects | |||||||||
Dependent Variable: Serum Cholesterol Level | |||||||||
Source | Type III Sum of Squares | df | Mean Square | F | Sig. | Partial Eta Squared | Noncent. Parameter | Observed Powerb | |
Corrected Model | 856.355a | 1 | 856.355 | 1.115 | .295 | .019 | 1.115 | .180 | |
Intercept | 2430254.343 | 1 | 2430254.343 | 3163.731 | .000 | .982 | 3163.731 | 1.000 | |
Sex | 856.355 | 1 | 856.355 | 1.115 | .295 | .019 | 1.115 | .180 | |
Error | 44553.326 | 58 | 768.161 | ||||||
Total | 2560765.379 | 60 | |||||||
Corrected Total | 45409.681 | 59 | |||||||
a. R Squared = .019 (Adjusted R Squared = .002) | |||||||||
b. Computed using alpha = .05 |
As the table above indicates the observed power is 0.18 or 18% and therefore β=1-0.18=0 .82
The power of test is very low therefore the probability of committing type ǀǀ error is high. We can increase the sample size in order to increase the power of the test.
Excel:
In Excel as opposed to SPSS we enter the data in two separate columns for each sample in a spread sheet (in this example, Female and Male).
Ctrl-m → on the Real Statistics panel click on Misc → T test and Non-Parametric Equivalents → ok → fill the boxes of Input Range 1 and Input Range 2 with the corresponding data, Alpha value, check Tow independent samples, and ok
The following is the output:
The table above indicates the same result as SPSS. Additionally, it also provides the value of effect size that we need to compute the power of the test in Excel with the following procedure.
Ctrl-m → on the Real Statistics panel click on Misc → Statistical Power and Sample Size → ok → check Two sample- t test → ok → fill the boxes → ok
The power calculated with above procedure is equal to 0.179 which is equal to the SPSS output.
Testing the difference between Tow Means (Dependent Samples or paired samples):
This test is used to compare the means of two related or paired groups. The two groups are matched since they share a subject or are related somehow. In most of the time, the characteristic of each observation is measured two times since we are interested in comparing them in two different conditions. For example, a researcher wishes to test a new drug for decreasing blood pressure. To run this hypothesis, the researcher selects a sample and measures their blood pressure. The data obtained, is called pretest data or baseline (before test), then after taking the new drug the variable (blood pressure) is remeasured. The data obtained this time, is called post-test data or after test and the researcher is interested in comparing the two-sample means obtained from pretest and post-test to see whether there is any difference or in another word, testing the efficacy. These two samples are linked or related by the same person. There are other cases where the samples are considered as a paired or dependent sample. Suppose, we are interested in comparing the average of desired salary in future between girls and boys. We select 30 girls and 30 boys and aske them that how much they wish to make in future. These two samples are independent which we explained previously. However, if each boy and girl are coupled we have no longer independent samples. The samples are related now which called paired samples. The procedure for conducting this hypothesis test is different from the independent sample. There is one independent variable with two related groups (before and after test) and one dependent variable that is measured in two groups of the independent variable. To analyse paired samples, we often analyse the absolute changes (the differences in dependent variable obtained from each paired sample or simply the differences between pretest and posttest) denoted by D which is equal to X1-X2. When X1 and X2 are data value from the pretest and post-test respectively. We apply the T-Test explained for one sample (one sample T-Test) to determine whether the mean of differences is equal to zero or not. That is equivalent to whether the mean of two groups of data (pretest and post-test) are equal or not.
The hypotheses are:
Two-tailed H0: D=0 v.s H1: D≠0
Left-tailed H0: D=0 v.s H1: D<0
Right-tailed H0: D=0 v.s H1: D>0
Where D is the population mean of these differences. We calculate D which is the sample mean of the differences and sD that is sample standard deviation of the differences to find the test statistic which is:
t=D-DsDn with d.f.=n-1
Assumption for the T-Test for Two Means When the Samples are Dependent
- The samples are randomly selected and independent of one another.
- Data in each group are paired with the data in another group.
- The dependent variable should be quantitative variable.
- The variable D is normally or approximately normally distributed and should not have any significant outliers.
In some researches, the absolute change or difference (D= X1-X2) is not considered to measure the change from pretest data or baseline. Another common method applied, is percentage change or relative change which denoted by P, calculated with the following formula:
P=X1-X2X1×100
Similarly, if percentage change method is pursued, the one sample T-Test is applied to determine whether the mean of percentage changes (P) is equal to zero or not. (the hypotheses mentioned above are applied for P with the test statistic of t=P-PsPn with d.f.=n-1).
According to Vickers (2001), the absolute change has acceptable statistical power when correlation between pretest and post-test data is high. Although ANCOVA (analyse of covariance) has the highest statistical power. He calculated the statistical power of the different methods of analysis and concluded that the percentage change is statistically inefficient and has less power based on simulation.
Example 8: A researcher wishes to investigate about the efficacy of a new drug for hypertension. He randomly selects 25 of volunteer patients with high blood pressure and measures their blood pressure before and after taking the new drug. At α=0.05 can we have enough evidence to support the new drug in terms of being effective?
- Absolute change method
The null hypothesis states that there is no difference between the two groups data and the new drug is not effective. However, the alterative hypothesis states the new drug has efficacy and lower the blood pressure in average compare to pretest data.
Left-tailed H0: D=0 v.s H1: D<0 (calim)
The following fictitious data are pretest and post-test data.
ID | Before (Pretest)X1 | After (Post-test) X2 | ID | Before (Pretest)X1 | After (Post-test)X2 | ID | Before (Pretest)X1 | After (Post-test)X2 | ID | Before (Pretest)X1 | After (Post-test)X2 |
1 | 148 | 149 | 9 | 149 | 143 | 17 | 148 | 142 | 25 | 154 | 143 |
2 | 149 | 152 | 10 | 156 | 149 | 18 | 152 | 150 | |||
3 | 151 | 143 | 11 | 147 | 142 | 19 | 155 | 144 | |||
4 | 159 | 152 | 12 | 148 | 144 | 20 | 158 | 157 | |||
5 | 154 | 144 | 13 | 155 | 144 | 21 | 143 | 141 | |||
6 | 144 | 141 | 14 | 158 | 150 | 22 | 160 | 162 | |||
7 | 148 | 148 | 15 | 147 | 142 | 23 | 145 | 141 | |||
8 | 147 | 142 | 16 | 147 | 142 | 24 | 159 | 153 |
The assumptions 1-3 are met since the sample are selected randomly and there is no relationship between them also the dependent variable is continuous. However, for the last assumption, we run the normality test. First, we define three variables into variable view window as ID, Pretest, and Post-test. Then we enter the data into the Data view window. To run the normality test we need the variable D (difference between the paired data). The following procedure generates the variable D in SPSS.
Transport → Compute Variable → define D as a target variable and in the Numeric Expression box, type “Pretest -Posttest”→ Ok
The variable D, generated by the procedure above, is considered to be tested for normality and the following is the output:
Tests of Normality | ||||||
Kolmogorov-Smirnova | Shapiro-Wilk | |||||
Statistic | df | Sig. | Statistic | df | Sig. | |
D | .116 | 25 | .200* | .964 | 25 | .502 |
*. This is a lower bound of the true significance. | ||||||
a. Lilliefors Significance Correction |
As the table and boxplot above indicate the last assumption is also met since the p-value is greater than 0.05 and there is no outlier in the box plot.
The following procedure is used to test the t test for paired samples.
Analyze→ Compare means→ Paired-samples t Test (move two variables Before (Pretest) and After (Posttest) to the right) → Option (select the confidence level)
Out put:
Paired Samples Test | |||||||||
Paired Differences | t | df | Sig. (2-tailed) | ||||||
Mean | Std. Deviation | Std. Error Mean | 95% Confidence Interval of the Difference | ||||||
Lower | Upper | ||||||||
Pair 1 | Pretest – Posttest | 4.84000 | 3.95474 | .79095 | 3.20756 | 6.47244 | 6.119 | 24 | .000 |
The t test value is 6.119 and the P-Value is 0.0002=0.000<α=0.05 (the p-value is divided by two since the hypothesis is one tailed). Thus, we reject the null hypothesis which means we have enough evidence to support the claim or the new drug is effective in average of the sample. As we can see the SPSS gives us confidence interval for the mean difference. Although we can find it with the following formula:
D-t2sDn<D<D+t2sDn
Since the null hypothesis is rejected, it is worth computing the effect size. The following formula (Cohen’s d) is used to compute the effect size for paired samples test.
Cohen’s d=MeanSD=4.843.95474=1.2238 which is considered as a large effect. We also calculate the statistical power of the test with Free Statistics Calculator → Statistical power → Post-hoc Statistical Power Calculator for a Student t-Test
Observed power for one-tailed hypothesis is 0.896 which is high.
- Percentage change method
As we explained, we analyse the percentage changes (P) to indicate whether the mean of percentage change is zero and apply the one sample T-Test.
Left-tailed H0: P=0 v.s H1: P<0 (calim)
The following procedure generates the variable P.
Transport → Compute Variable → define P as a target variable and in the Numeric Expression box, type “(Pretest-Posttest)Pretest×100”→ Ok
We run the normality test for the variable P with the following output:
Tests of Normality | ||||||
Kolmogorov-Smirnova | Shapiro-Wilk | |||||
Statistic | Df | Sig. | Statistic | df | Sig. | |
P | .135 | 25 | .200* | .962 | 25 | .456 |
*. This is a lower bound of the true significance. | ||||||
a. Lilliefors Significance Correction |
The normality assumption is met and there is no outlier which make us ready to run the one sample T-Test with the test value of 0. We already mentioned the procedure so the following is the output:
One-Sample Statistics | ||||
N | Mean | Std. Deviation | Std. Error Mean | |
P | 25 | 3.1797 | 2.55727 | .51145 |
One-Sample Test | ||||||
Test Value = 0 | ||||||
t | Df | Sig. (2-tailed) | Mean Difference | 95% Confidence Interval of the Difference | ||
Lower | Upper | |||||
P | 6.217 | 24 | .000 | 3.17975 | 2.1242 | 4.2353 |
As the table above indicates the p-value is (0/2=0 since the test is one-tailed) is significant and we reject the null hypothesis. That means we do have enough evidence to support the claim that the new drug is effective. The result is equal with the absolute change procedure thus we calculate the statistical power of test. As mentioned the formula for effect size before:
Cohen’s d=MeanSD= 3.17972.55727=1.2433 and with Free Statistics Calculator the statistical power is 0.9. Thus, the results of two methods are equal with the same power.
As we explained in the independent two samples T Test, the Real Statistics Resource Pack in Excel provides the T Test for paired samples, the value of effect size and the power of the test.
Testing the Difference Between Proportions
In this case, we are interested in comparing proportions of two independent samples. For example, a researcher wants to know if there is any difference between the proportion of smoker women and smoker men in a population. As we explained for the z Test for a proportion, we denote two population proportions by P1 and P2. Likewise, two sample proportions by p1= x1/n1 and p2= x2/n2 where x1 and x2 are the number of individuals of samples that have specific character.
We can have the following hypotheses.
H0 : P1=P2 v.s H1: P1P2 or H0 : P1-P2 =0 v.s H1: P1-P2 ≠0 (two-tailed)
H0 : P1=P2 v.s H1: P1>P2 (right-tailed)
H0 : P1=P2 v.s H1: P1<P2 (left-tailed)
Formula for the z Test for Comparing Two Proportions
z=p1-p2-(P1-P2)pq (1n1+1n2)
Where
p=x1+x2n1+n2 and q =1-p
Assumption for the z Test for two Proportions
- The samples must be selected randomly.
- There must be two independent samples.
- For both samples np≥5 and nq≥5.
We use R to run this test or http://www.socscistatistics.com/tests/ztest/Default2.aspx.
>prop.test(X=c(X1, X2), n=c(n1, n2)) if the test is right-tailed or left-tailed, we add alternative=”less or greater”
Example 9: A researcher reported that 9 out of 25 women and 5 out of 20 men received flu shot within previous year. Suppose the samples has been selected randomly and independently. At α=0.05, test the claim that there is no difference in the proportions of men and women who received flu shot within pervious year.
The hypothesis is:
H0 : P1=P2 claim v.s H1 :P1P2
In this example x1=9, x2=5, n1=25, n2=20, so we can run the test in R with the following procedure:
> prop.test(x=c(9,5), n=c(25,20))
Procedure above also provides the confidence interval for the difference between two proportions or we can use the following formula:
Confidence Interval for the Difference Between Two Proportions
(p1-p2)-z2p1q1n1+p2q2n2<P1+P2<(p1-p2)+z2p1q1n1+p2q2n2
Or we can use the online calculator by http://www.socscistatistics.com/ .
The online calculator above provides us, the z-score=0.792 and the p-value=0.42952 which is not significant since it is greater than 0.05. Thus, we do not have enough evidence to reject the claim or that, there is no difference between the proportion of men and women who received flu shot within pervious year.
Testing the difference between two variances
We might be interested in comparing two variances of two independent groups which an F test can be used. Thus, suppose two independent samples with two variances of s12 and s22 from two normal populations with two variances of ơ12 and ơ22. The F Test is equal to F=s1212s2222 which if the null hypothesis of equality of variances is true the F Test is:
F=s12s22 with d.f.1=n1-1 and d.f.2=n2-1 degrees of freedom when n1 and n2 are sample sizes. The test can be expressed as a two-tailed or one-tailed test thus the null hypothesis is:
H0: ơ12 = ơ22 versus either of the following alternative hypotheses:
H1: ơ12 ơ12 or H1: ơ12 >ơ12 or H1: ơ12 < ơ12.
SPSS uses Levene’s test to examin the equality of variances, called testing homogeneity of variances or homoscedasticity for a variable calculated for two or more groups.
SPSS:
Analyze→ Compare Means→ Independent-Sample T Test (move test variable and grouping variable to the right panel and define groups as we already explained for comparing two means of two independent groups).
Or we can use Excel that uses F test explained above with the following procedure:
Entering the data in Excel is different from SPSS. In Excel, we enter two independent sample data into two columns then:
Data Analysis→ F test Two Sample for Variances (type the range for variable 1, variable 2 and out put by clicking on the correspond box and drag the correspond data) →ok
One-Way Analyse of Variance (ANOVA)
ANOVA or analysis of variance is a statistical test used to compare the means of three or more independent groups simultaneously. This test determines any statistically significant differences between the means of groups, however does not specifies which groups are different from each other. Although after identifying any statistically significant differences, we apply a post hoc test to specify the differences. In another word, we can say ANOVA is the extension of the t Test for two independent sample means since there are more than two groups. Although the procedure of ANOVA is different from the two independent samples t Test since we analyse the variances rather than the means. In fact, ANOVA indicates whether the groups have any effect on the dependent variable based on the variations and calculate the variances and compare them. In ANOVA method, we calculate three variations, that are total variation, within groups variation, and between groups variation. The “total variation” is denoted by SST or the sum of squares total indicates the total variation of all the individuals from the grand mean (the mean of all observations). The “within groups” variation is denoted by SSW/SSE or the sum of squares within groups (the sum of squared errors) is due to the deviations of each individual from the respective group mean. The “between groups” variation is denoted by SSB or the sum of the squares between groups is due to the deviations of each group mean from the grand mean. As the following equation indicates, the total variation across the population is equal to the sum of the within groups variation and the between groups variation. SST=SSB+SSW
ANOVA enables us to explore how much of the total variation in the population is due to the variation between the groups versus the variation within the groups. Although to calculate a test statistic for ANOVA, we need to calculate the variances for the within and between groups variations which we will explain in following.
The ratio of the between group variance or the mean square between groups denoted by MSB to the within group variance or the mean square within groups denoted by MSW is a test statistic for ANOVA that has a F distribution. The analyze of variance or ANOVA compares these two variances with the F test. If these two variances are equal, that means there is no differences between the group means. However, the value of the F greater than 1, indicates the between group variance is larger than within group variance thus, the means are not equal.
The following hypotheses is used:
H0 : 1=2=3=…=k v.s H1:at least one mean is different from the others
Where k is the number of groups and i is the mean of i th group.
The following table indicates the summary table of ANOVA:
One-Way ANOVA
Source | Sum of Squares | d.for degree of freedom | Mean Squares | F |
Between Groups | SSB | K-1 | MSB | F=MSBMSW |
Within Groups | SSW | N-k | MSW | |
Total | SST | N-1 |
When each mean square or variance is the respective sum of square divided by its degree of freedom.
Suppose there are k independent groups which we are interested in comparing their means and ni is the sample size of i th groups so we have: N=i=1kni
Degree of freedom for between groups is: d.f.B=k-1.
Degree of freedom for within groups is: d.f.W=N-k.
Degree of freedom for total is: d.f.T=N-1.
We always have d.f.T=d.f.B+d.f.W
MSB=SSBk-1 , MSW=SSWN-k , F=MSBMSW
Thus, in one-way ANOVA we have one independent variable with three or more independent levels or treatments and one dependent variable that is quantitative.
Assumptions:
- The groups should be independent which means there is no relationship between the observations in each group or between the groups themselves.
- Dependent variable should be quantitative variable.
- The observations or the dependent variable should be approximately normally distributed in each group, otherwise the samples must be large enough (n>30).
- No significant outliers.
- The homogeneity of variances between groups (especially when the sample sizes are unequal).
The following indicates the SPSS procedure for ANOVA with the example.
As we already explained in testing the difference between two means for independent samples, we have two variables which are an explanatory or our groups variable and a respond variable which is the one that we measure it’s character.
Example 10: A researcher wishes to run an experimental research about a new drug for hypertension. He randomly selects 75 volunteer patients who are assigned in three groups of 25 at random. The first group is assigned to take the new drug, the second group is assigned to take the old drug, and third group is assigned to take the placebo (fake drug). Suppose the assignments are at random and there is no relationship between the patients. Following table indicates the difference between each patient’s blood pressure before and after taking the assigned drug within 3 weeks (absolute change “D”) as well as the percentage change “P”. At α= 0.05, test whether there are any differences between the three group means in separate analysis for absolute change and percentage change.
Group 1 (new drug) D P | Group 2 (old drug) D P | Group 3 (placebo or fake drug)D P | ||||||
17 | 11.87 | 18 | 11.51 | 0 | 0 | |||
25 | 17.77 | 37 | 23.11 | -5 | -3.34 | |||
3 | 1.92 | 6 | 4.24 | -2 | -1.31 | |||
49 | 31.72 | 26 | 17.33 | -3 | -2 | |||
38 | 23.90 | 20 | 13.35 | -5 | -3.34 | |||
27 | 18.15 | 2 | 1.32 | 3 | 1.92 | |||
45 | 31.09 | 33 | 21.83 | -4 | -2.58 | |||
16 | 10.55 | 13 | 8.86 | 2 | 1.32 | |||
25 | 16.29 | 4 | 2.75 | 4 | 2.84 | |||
27 | 17.44 | 15 | 9.28 | -3 | -2.11 | |||
41 | 27.03 | 23 | 15.13 | -1 | -0.66 | |||
38 | 25.26 | 40 | 27.04 | -5 | -3.18 | |||
13 | 8.14 | 14 | 9.60 | 3 | 2.07 | |||
36 | 23.14 | 6 | 4 | -1 | -0.68 | |||
45 | 29.68 | 29 | 18.76 | -5 | -3.39 | |||
35 | 23.03 | 4 | 2.53 | -3 | -1.99 | |||
1 | 0.65 | 32 | 21.59 | -4 | -2.78 | |||
34 | 23.12 | 19 | 13.03 | 4 | 2.54 | |||
30 | 18.78 | 6 | 3.98 | -2 | -1.32 | |||
24 | 16.44 | 4 | 2.7 | -3 | -2.04 | |||
3 | 2.03 | 17 | 11.36 | -2 | -1.36 | |||
33 | 23.14 | 6 | 4.1 | -2 | -1.31 | |||
20 | 13.33 | 29 | 19.62 | -1 | -0.63 | |||
39 | 25.56 | 14 | 9.32 | -3 | -2 | |||
23 | 14.47 | 28 | 18.28 | -1 | -0.65 |
Absolute change as the dependent variable:
The hypotheses are:
H0 : 1=2=3 v.s H1:at least one mean is different from the others (claim)
As we explained in the T test for two independent sample means, in the one-way ANOVA test as well, there is one independent variable which is called group or factor variable with more than 2 levels and one dependent or response variable. In this example, the group variable is “Drug” that has three levels. We identify these levels as: New Drug “1”, Old Drug “2”, and Placebo “3”. We enter the data as we explained in the T test for two independent sample means. Before assessing the assumptions, it is worthwhile to visualise the data assigned in three groups. The following figure represents the individuals in each group. As you can see the difference between three means are obvious. The mean of first group “new drug” is the highest and the mean of third group “placebo” is the smallest mean. Thus, we are quite positive that there is at least a difference that is not due to chance. The following command is used to provide the plot.
Graphs → Chart Builder → ok → choose “Scatter/Dot” and dabble click on the first option → drag the group variable (Drug) into the X-Axis box and the dependent variable (Absolut change) into the Y-Axis box → ok
The assumptions should be met before running the ANOVA test. As we supposed, the first assumption is met and we know the dependent variable is a continuous variable, therefore we first check the normality assumption in each group.
Normality assumption:
Analyze→ Descriptive Statistics→ explore (move the dependent variable and independent variable into respective boxes) → Plots (check Normality plots with tests and Histogram and uncheck stem and leaf) → Continue → Ok
Tests of Normality | |||||||
Drug Type | Kolmogorov-Smirnova | Shapiro-Wilk | |||||
Statistic | df | Sig. | Statistic | df | Sig. | ||
Difference | New Drug | .100 | 25 | .200* | .956 | 25 | .342 |
Old Drug | .167 | 25 | .070 | .934 | 25 | .108 | |
Placebo | .181 | 25 | .034 | .894 | 25 | .013 | |
*. This is a lower bound of the true significance. | |||||||
a. Lilliefors Significance Correction |
The P-values for the first and second groups are not significant so the normality assumption for these groups are met. Although, for the third group the p-value is less than 0.05 but since it is greater than 0.01, we can say the data in this group also, are approximately normally distributed since ANOVA is quite robust in terms of violating normality assumption and the sample size in not too small (n3=25).
Testing the outliers:
Analyze→ Descriptive Statistics→ explore (move the dependent variable (in this example Difference change) into the Dependent List box and the group variable into the Factor list box) → Statistics (check Outliers and Percentiles) → Continue → Ok
The following graph represents each group’s boxplot.
As the third box plot for the placebo group indicates, there are 4 mild outliers close to the top wisker. With the following procedure, we will generate the Z-scores to compare with -3 and 3.
Analyze → Descriptive Statistics → Descriptives → move the dependent variable into the right box and check the Save standardized values as variables box → ok
The Z-scores are represented in the Data View window. The Z-values for each of the four data sets represented above do not exceed -3 or 3, which means they are not significant outliers.
Homogeneity of variances:
This assumption will be checked when we run for ANOVA.
Analyze→ Compare Means→ One-Way ANOVA (move the dependent variable and independent (group) variable into respective boxes) → Options (you can choose Descriptive and homogeneity of variance test) → Continue→ Ok
Output:
Test of Homogeneity of Variances | |||
Difference | |||
Levene Statistic | df1 | df2 | Sig. |
16.147 | 2 | 72 | .000 |
The P-Value for the test of homogeneity of variance is significant, since it is less than 0.05 (0.000<0.05). That means, the null hypothesis (H0: 12=22=32) is rejected and the assumption is violated. Therefore, we rerun the ANOVA test and choose Welch or Brown and Forsythe test to interpret the ANOVA test.
Analyze→ Compare Means→ One-Way ANOVA (move the dependent variable and independent variable into respective boxes) → Options (check Welch or Brown-Forsythe test) → Continue→ Ok
Output:
Descriptives | ||||||||||||||
Difference | ||||||||||||||
N | Mean | Std. Deviation | Std. Error | 95% Confidence Interval for Mean | Minimum | Maximum | ||||||||
Lower Bound | Upper Bound | |||||||||||||
New Drug | 25 | 27.4800 | 13.34516 | 2.66903 | 21.9714 | 32.9886 | 1.00 | 49.00 | ||||||
Old Drug | 25 | 17.8000 | 11.52533 | 2.30507 | 13.0426 | 22.5574 | 2.00 | 40.00 | ||||||
Placebo | 25 | -1.5600 | 2.81484 | .56297 | -2.7219 | -.3981 | -5.00 | 4.00 | ||||||
Total | 75 | 14.5733 | 15.84739 | 1.82990 | 10.9272 | 18.2195 | -5.00 | 49.00 | ||||||
ANOVA | ||||||||||||||
Difference | ||||||||||||||
Sum of Squares | df | Mean Square | F | Sig. | ||||||||||
Between Groups | 10931.947 | 2 | 5465.973 | 51.428 | .000 | |||||||||
Within Groups | 7652.400 | 72 | 106.283 | |||||||||||
Total | 18584.347 | 74 |
Robust Tests of Equality of Means | ||||
Difference | ||||
Statistica | df1 | df2 | Sig. | |
Welch | 84.305 | 2 | 35.072 | .000 |
Brown-Forsythe | 51.428 | 2 | 49.367 | .000 |
a. Asymptotically F distributed. |
The first table above, indicates the descriptive statistics for each group, the second table, is the ANOVA table but since the homogeneity of variances is violated, can not be interpreted. Therefore, we interpret the p-value obtained from the Welch or Brown-Forsythe test. As the both tests indicate the p-value is significant and the null hypothesis is rejected. That means, the means of three groups are different or least one mean is different from the others and we have enough evidence to support the claim.
Partial eta Squared is a measure of the effect size for ANOVA and can be interpreted the proportion of variability in the dependent variable explained by the independent variable. According to one guidelines the value of partial eta squared equals to 0.01 is small, 0.06 is medium, and 0.138 is large. We can calculate it by SPSS with the following procedure:
Analyze → General Linear Model → Univariate (move the dependent and factor variable to the right) → Option (check Estimate of effect size and Observed power) → Continue → Ok
Tests of Between-Subjects Effects | ||||||||
Dependent Variable: Difference | ||||||||
Source | Type III Sum of Squares | df | Mean Square | F | Sig. | Partial Eta Squared | Noncent. Parameter | Observed Powerb |
Corrected Model | 10931.947a | 2 | 5465.973 | 51.428 | .000 | .588 | 102.857 | 1.000 |
Intercept | 15928.653 | 1 | 15928.653 | 149.870 | .000 | .675 | 149.870 | 1.000 |
Drug | 10931.947 | 2 | 5465.973 | 51.428 | .000 | .588 | 102.857 | 1.000 |
Error | 7652.400 | 72 | 106.283 | |||||
Total | 34513.000 | 75 | ||||||
Corrected Total | 18584.347 | 74 | ||||||
a. R Squared = .588 (Adjusted R Squared = .577) | ||||||||
b. Computed using alpha = .05 |
The table above indicates that eta squared is 0.588 (58% of the variance in the difference change is explained by the type of drug) which is large and the power of the test is 1 which is also high. When the null hypothesis is rejected, we wish to determine which group means are different from the other. Thus, after running the ANOVA and rejecting the null hypothesis we run a Post Hoc test such as Tukey or Scheffe tests to determine which means are different. The Tukey test is used when the groups have the same sample sizes and the Scheffe test is used when the groups have different sample sizes. However, those tests are used when the homogeneity of variances assumption is met. Since in this example the homogeneity of variances assumption is violated, we choose the Tamhane’s T2 test. In SPSS after moving the variables you can click on Post Hoc and choose the appropriate test:
Post Hoc Tests
Multiple Comparisons | ||||||
Dependent Variable: Absolute change | ||||||
Tamhane | ||||||
(I) Drug Type | (J) Drug Type | Mean Difference (I-J) | Std. Error | Sig. | 95% Confidence Interval | |
Lower Bound | Upper Bound | |||||
New Drug | Old Drug | 9.68000* | 3.52662 | .025 | .9488 | 18.4112 |
Placebo | 29.04000* | 2.72776 | .000 | 22.0825 | 35.9975 | |
Old Drug | New Drug | -9.68000* | 3.52662 | .025 | -18.4112 | -.9488 |
Placebo | 19.36000* | 2.37282 | .000 | 13.3189 | 25.4011 | |
Placebo | New Drug | -29.04000* | 2.72776 | .000 | -35.9975 | -22.0825 |
Old Drug | -19.36000* | 2.37282 | .000 | -25.4011 | -13.3189 | |
*. The mean difference is significant at the 0.05 level. |
Table above indicates the differences are significant between all three groups which means the new drug is effective and the value of partial eta squared states that the effectiveness is high.
As we explained in the example 8 for the paired samples, we rerun the test to exchange the dependent variable “absolute change” with “percentage change”.
Percentage change as the dependent variable:
The following plot is generated to visualise the individuals before analysing them. This plot as well as the plot for the absolute change indicates the mean of placebo group is smaller than the two other groups. However, the means of new drug and old drug are close.
We skip assessing the assumptions since the results are almost the same as the results of absolute change. Following is the result of the ANOVA test considering percentage change as the dependent variable:
ANOVA | |||||
Percentage change | |||||
Sum of Squares | df | Mean Square | F | Sig. | |
Between Groups | 4789.889 | 2 | 2394.944 | 51.707 | .000 |
Within Groups | 3334.851 | 72 | 46.317 | ||
Total | 8124.740 | 74 |
Robust Tests of Equality of Means | ||||
Percentage change | ||||
Statistica | df1 | df2 | Sig. | |
Welch | 84.591 | 2 | 35.148 | .000 |
Brown-Forsythe | 51.707 | 2 | 49.342 | .000 |
a. Asymptotically F distributed. |
Since the P-Value is 0 and significant, therefore the result for the percentage change is the same as the absolute change and states that there is at least a statistically significant difference between the old and new and placebo drug. The following table also indicates the partial eta squared and power of the test for this ANOVA test. As you can see the values of partial eta squared and power of the test for both ANOVA tests are equal.
Tests of Between-Subjects Effects | ||||||||
Dependent Variable: Percentage change | ||||||||
Source | Type III Sum of Squares | df | Mean Square | F | Sig. | Partial Eta Squared | Noncent. Parameter | Observed Powerb |
Corrected Model | 4789.889a | 2 | 2394.944 | 51.707 | .000 | .590 | 103.415 | 1.000 |
Intercept | 6972.278 | 1 | 6972.278 | 150.533 | .000 | .676 | 150.533 | 1.000 |
Drug | 4789.889 | 2 | 2394.944 | 51.707 | .000 | .590 | 103.415 | 1.000 |
Error | 3334.851 | 72 | 46.317 | |||||
Total | 15097.018 | 75 | ||||||
Corrected Total | 8124.740 | 74 | ||||||
a. R Squared = .590 (Adjusted R Squared = .578) | ||||||||
b. Computed using alpha = .05 |
Two-Way Analysis of Variance
Two-Way ANOVA is applied when we are interested in evaluating the effects of two independent variables on one dependent variable simultaneously. The two independent variables are called factors or group variables. The two-way ANOVA test provides the results of three hypothesis tests at a time. The two hypothesis tests determine whether there are any differences between the group means that each test consider one respective factor at a time. In another word, these hypothesis tests evaluate any effect of each factor on the dependent variable called main effect. More importantly the two-way ANOVA test evaluates also any interaction effect between the two factors on the dependent variable. Since there are two independent variables with two or more levels considered at a time, the effects of the levels of two independent variables on the dependent variable might be vary from if these two variables are applied separately. Which means two independent variables might interacting with each other.
Suppose the first factor has “a” levels and the second factor has “b” levels so the ANOVA test is called “a b” design and this design generates a b groups that are called treatment groups which represent all combinations of the two factors.
The following table represents the 2 3 design that generates six treatment groups since there are two factors A and B with two and three levels respectively.
Factor B
b1 b2 b3
Factor A level a1,Factor B level b1Group 1 | Factor A level a1,Factor B level b2Group 2 | Factor A level a1,Factor B level b3Group 3 |
Factor A level a2,Factor B level b1Group 4 | Factor A level a2,Factor B level b2Group 5 | Factor A level a2,Factor B level b3Group 6 |
a1
Factor A
a2
For example, in Group 1, the dependent variable of each individual is influenced by level a1 of factor A and level b2 of factor B.
Possible hypothesis tests in two-way ANOVA:
(main effects)
H0: There is no difference between the means across the groups are made by the levels of the first / second factor v.s H1:There is at least a difference between the group means
(interaction effect)
H0:There is no interaction effect between the levels of the two independent variables on the dependent variable v.s H1:There is an interaction effect between the two levels of factors
The following table represents the two-way ANOVA table which is similar to one-way ANOVA:
Source | Sum of Squares | d.f | Mean Square | F | |
Main Effect A | SSA | a-1 | MSA | FA | |
Main Effect B | SSB | b-1 | MSB | FB | |
Interaction Effect AB | SSA×B | (a-1)(b-1) | MSA×B | FA×B | |
Within | SSW | ab(n-1) | MSW | ||
Total | SST | N-1 |
SSA: sum of squares for factor A
SSB: sum of squares for factor B
SSA B: sum of squares for interaction effect
SSW: sum of squares within each treatment group (error)
Each mean square is calculated by the corresponding sum of square divided by it’s degree of freedom. Likewise, each F is obtained by correspond mean square divided by MSW.
Interaction Effect Types:
- Ordinal Interaction: occurs when the means of one factor across it’s levels are greater than the means of another factor across it’s levels. When graphing these means, the lines will not be parallel whilst they will not cross each other.
- Disordinal Interaction: occurs when not all the means of one factor across it’s levels are greater than the other levels of factor mean or graphically the lines will cross each other.
- No Interaction: occurs when there is no significant interaction effect and the lines are parallel.
Note: If the there is no interaction effect which means the interaction effect is not statistically significant, we can interpret the main effects individually (the following example) from the two-way ANOVA table. However, with any statistical significant effect interpreting the main effect individually can be misleading.
Example 11: Recall example 10, this time the researcher took the total cholesterol level into the account. He wishes to know whether the effect of the drug type on the blood pressure is influenced by the total cholesterol level. He randomly selected 270 volunteer patients who are divided by the total cholesterol levels which are “low, medium, and high”. Finally, he randomly selected 30 patients of each total cholesterol level to be assigned to take the new drug and repeated this selection process for both the old drug and the placebo. Suppose the assignments are at random and there is no relationship between the patients. Therefore, there are 9 treatment groups with each group consisting of 30 patients. The dependent variable is the difference between each patient’s blood pressure before and after taking the assigned drug within 3 weeks (absolute change “D”) / the percentage change “P”. the independent variables (factors) are the Total Cholesterol Level and the Drug type. At α= 0.05, he applies the two-way ANOVA test to examine the following hypothesis tests for the absolute and percentage change:
H0: There is no difference between the means of absolute /percent changes across the types of the drug (main effect)
.s
H1: At least one mean is different from the other
or
H0: There is no difference between the means of absolute /percent changes across the levels of the total cholesterol (main effect)
.s
H1: At least one mean is different from the other
Or
H0: There is no interaction effect between the total cholesterol levels and the type of drugs on the blood pressure
.s
H1: There is an interaction effect between the total cholesterol levels and the type of drugs on the blood pressure
The assumptions for one-way and two-way ANOVA are the same. Although, in two-way ANOVA, the normality, outliers, and homogeneity of variances should be investigated for each combination of the levels of the two independent variables and for the tow-way ANOVA test, the sample size of the groups must be equal as opposed to one-way ANOVA.
The following procedure is used to split the data to test the normality and outlier assumptions.
Data → Split File → check Organize output by groups and move the factors or independent variables (in this example: Total cholesterol level and Drug type) into the right box → ok
A notification comes up in the output window that your data is grouped by the levels of each of the independent variables. As the one-way ANOVA test, the following procedure is used to test the normality and outlier assumptions for each combination of the levels of the two independent levels.
Analyze → Descriptive Statistics → Explore → fill the Dependent and Factor list boxes with the corresponding variables → Statistics (check Discriptives and outliers) → Continue → Plots (check Normality plot with tests) → Continue → ok
After running above procedure, there are 9 out puts for each combination. Since the normality and outlier assumptions are met for each combination, we only demonstrate the following out put for the combination of the low level of total cholesterol and the new drug. (with exchanging the absolute change with the percentage change in the dependent variable, the assumptions will be checked for the percentage change as the dependent variable)
Drug type = New Drug, Total Cholesterol level = Low
Tests of Normalitya | |||||||
Drug type | Kolmogorov-Smirnovb | Shapiro-Wilk | |||||
Statistic | df | Sig. | Statistic | df | Sig. | ||
Absolute change | New Drug | .101 | 30 | .200* | .956 | 30 | .249 |
*. This is a lower bound of the true significance. | |||||||
a. Drug type = New Drug, Total Cholesterol level = Low | |||||||
b. Lilliefors Significance Correction |
Before running the two-way ANOVA test, you should go back to Split File under Data and check Analyze all cases. The following procedure is used for the two-way ANOVA test as well as the test for homogeneity of variances assumption.
Analyze→ General Linear model→ univariate (move the independent variables (Drug and Cholesterol) into the “Fixed Factor” box and the dependent variable (Absolute change/Percentage change) into the “Dependent variable” box) → Plots → move Cholesterol into the “Horizontal Axis” box and Drug into the “Separate Lines” box (usually move the factor with more levels to Horizontal Axis and the factor with less levels into Separate lines) → Add (repeat the procedure for each factor separately) → Continue → Options (move the factors including (OVERALL, Drug, Cholesterol, and Drug*Cholesterol) into “Display Means for” box and check Compare main effects, Descriptive Statistics, estimate of effect size, and Homogeneity test) →Continue → add)→ Post Hoc (move the factor that has more than two levels and check an appropriate test)→ Ok
Out put:
The following out put is for the test for homogeneity of variances for each combination of the levels of the two independent variables. There is not a statistically significant difference between the variances since the P-Value (0.088) is greater than 0.05. Therefore, this assumption is met across the combinations.
Levene’s Test of Equality of Error Variancesa | ||||||||||||
Dependent Variable: Absolute change | ||||||||||||
F | df1 | df2 | Sig. | |||||||||
1.749 | 8 | 261 | .088 | |||||||||
Tests the null hypothesis that the error variance of the dependent variable is equal across groups. | ||||||||||||
a. Design: Intercept + Drug + Cholesterol + Drug * Cholesterol | ||||||||||||
Tests of Between-Subjects Effects | ||||||||||||
Dependent Variable: Absolute change | ||||||||||||
Source | Type III Sum of Squares | df | Mean Square | F | Sig. | Partial Eta Squared | Noncent. Parameter | Observed Powerb | ||||
Corrected Model | 40731.376a | 8 | 5091.422 | 202.165 | .000 | .861 | 1617.321 | 1.000 | ||||
Intercept | 52035.597 | 1 | 52035.597 | 2066.178 | .000 | .888 | 2066.178 | 1.000 | ||||
Drug | 40662.577 | 2 | 20331.288 | 807.295 | .000 | .861 | 1614.589 | 1.000 | ||||
Cholesterol | 41.410 | 2 | 20.705 | .822 | .441 | .006 | 1.644 | .190 | ||||
Drug * Cholesterol | 27.389 | 4 | 6.847 | .272 | .896 | .004 | 1.088 | .110 | ||||
Error | 6573.147 | 261 | 25.184 | |||||||||
Total | 99340.120 | 270 | ||||||||||
Corrected Total | 47304.523 | 269 | ||||||||||
a. R Squared = .861 (Adjusted R Squared = .857) | ||||||||||||
b. Computed using alpha = .05 |
As the table above indicates, the interaction effect is not statistically significant. Thus, as we mentioned before we can interpret the main effects individually from the table above. The first hypothesis test for the main effect of Drug has the P-Value of “0” which means the null hypothesis is rejected. Therefore, there is a statistical difference between the means of the absolute changes across the drug type. Moreover, it’s partial eta squared indicates that 86% of the variation on the absolute change is because of the drug levels. However, there is no statistically significant effect of the total cholesterol levels and the interaction effect. That means for the main effects, the levels of the drug have different effect on the absolute change even though the levels of the total cholesterol have the same effect on the absolute change. Regarding the interaction effect, the effect of the drug type on the blood pressure is not influenced by the total cholesterol level. The following table indicates the result of Post Hoc test for which the Tukey test is chosen. All the levels of the drug type are statistically different in terms of the means of the absolute change since the P-Values are less than 0.05.
Multiple Comparisons | ||||||
Dependent Variable: Absolute change | ||||||
Tukey HSD | ||||||
(I) Drug type | (J) Drug type | Mean Difference (I-J) | Std. Error | Sig. | 95% Confidence Interval | |
Lower Bound | Upper Bound | |||||
New Drug | Old Drug | 7.8702* | .74810 | .000 | 6.1068 | 9.6336 |
Placebo | 29.0599* | .74810 | .000 | 27.2965 | 30.8233 | |
Old Drug | New Drug | -7.8702* | .74810 | .000 | -9.6336 | -6.1068 |
Placebo | 21.1897* | .74810 | .000 | 19.4263 | 22.9531 | |
Placebo | New Drug | -29.0599* | .74810 | .000 | -30.8233 | -27.2965 |
Old Drug | -21.1897* | .74810 | .000 | -22.9531 | -19.4263 | |
Based on observed means. The error term is Mean Square(Error) = 25.184. | ||||||
*. The mean difference is significant at the .05 level. |
Absolute change | ||||
Tukey HSDa,b | ||||
Drug type | N | Subset | ||
1 | 2 | 3 | ||
Placebo | 90 | -2.8673 | ||
Old Drug | 90 | 18.3224 | ||
New Drug | 90 | 26.1925 | ||
Sig. | 1.000 | 1.000 | 1.000 | |
Means for groups in homogeneous subsets are displayed. Based on observed means. The error term is Mean Square(Error) = 25.184. | ||||
a. Uses Harmonic Mean Sample Size = 90.000. | ||||
b. Alpha = .05. |
The figure above, represents the plot of the means of each treatment groups. This plot shows the lines are approximately parallel therefore there is no interaction effect as we expected. The means of new drug are higher than the old and placebo drugs across the total cholesterol levels. In each line representing each level of the drug, the three means are almost in the same level and equal which means the effect of the total cholesterol level on the absolute change are not statistically significant.
Note: If the interaction effect were significant, since SPSS would not provide pairwise comparisons table we would run it in the syntax.
Analyze→ General Linear model→ univariate → Paste
We would see ”/EMMEANS=TABLES(Drug*Cholesterol)” in the window then add “COMPARE(Drug)”. Select all the lines and Run.
The result of this example for both absolute change and percentage change are similar.
Nonparametric Statistics
As we previously mentioned parametric tests are based on some assumptions that the most important one is normality for population under study. However, if this assumption is not met or if the test is not about mean, variance or proportion then nonparametric tests are used even though they are less sensitive, less informative, and less efficient. The only assumption is that the samples are selected randomly. Since the normality assumption is not considered, these tests are sometimes called distribution free tests. These tests can also be used when the data are nominal or ordinal so the usual central measure is median whereas for the parametric tests is the mean.
Chi-Square Test
Chi-Square Test for Frequency (Goodness of Fit or Pearson’s chi-square goodness of fit):
The chi-square test (goodness of fit) is a single sample nonparametric test that used for investigating about the frequencies of qualitative (categorical) data across the levels of a categorical variable. This test requires to have one categorical variable with two or more mutually exclusive levels. We wish to know whether the frequencies of these levels are equal or follow a hypothesized distribution. For example, suppose we are interested in comparing people preferences in taking different types of painkilling pill between the four most common over-the-counter analgesics which are aspirin, acetaminophen(Tylenol), ibuprofen (Advil), and naproxen (Aleve). So, there is one categorical variable which is painkilling drug with four levels and the chi-square statistic can show whether people have the same preference to take each type of painkilling drug that mentioned above. With another example, suppose an emergency service may wish to know whether it receives more patients who are urgent on a specific day in a week so it can provide adequate staff accordant to the results.
To run the test first we state the hypotheses:
H0: There is no differences between the frequencies across the levels or the frequencies are distributed as a hypothesized pattern.
H1: there is at least one level that has different frequency from the others or the frequencies are not distributed as a hypothesized pattern.
As the null hypothesis describes, we expect a value for frequency of each level that is called Expected Frequency which is denoted by E. Then a sample of population is selected to obtain the data or actual frequency of each level which is called Observed Frequency denoted by O.
Chi-Square Test for Frequency (Goodness of Fit) is used to determine whether the differences between the observed frequencies and expected frequencies are statistically significant.
Assumptions:
- A random sample should be used.
- The variable is categorical (nominal or ordinal).
- The expected frequency for more than 20% of the levels should be at least 5.
- The observations or individuals should be independent.
The Chi-Square test statistic is:
2=(O-E)2E
With degree of freedom of the numbers of levels minus 1.
SPSS:
For running the test with SPSS, we can enter data in two ways the first one is that our categorical variable is defined in the variable view which is labeled for each level.
Example 12: Suppose we are interested in testing people preferences about painkilling drug that mentioned above and for the sample size of twenty the data is gathered. First, we label each level of the variable as Aspirin – “1”, Acetaminophen(Tylenol) – “2”, Ibuprofen (Advil) – “3”, and Naproxen (Aleve) “4”. (α=0.05)
The hypotheses are:
H0: There are no differences between the population preferences for the brands
H1: There are differences between the population preferences for the brands
Since there are 4 levels, we expect that the levels have equal preferences so the expected frequency for each level is 20/4=5. However, suppose the following data is collected after the survey.
1 2 3 2 3 4 4 1 3 4 2 2 4 4 3 2 1 2 4 3
First, in Variable View window define the categorical variable (Painkiller) and label the levels. Then, in Data View window enter the data above. Finally follow the following procedure to run the chi-square test.
Analyze → Nonparametric Tests → Legacy Dialogs → chi-square (move the categorical variable into the Test variable list box and check the box for all categories equal) → Options (Descriptive) → ok
Output:
Descriptive Statistics | |||||
N | Mean | Std. Deviation | Minimum | Maximum | |
Painkilling Drug | 20 | 2.7000 | 1.08094 | 1.00 | 4.00 |
Chi-Square Test
Frequencies
Painkilling Drug | |||
Observed N | Expected N | Residual | |
Aspirin | 3 | 5.0 | -2.0 |
Tylenol | 6 | 5.0 | 1.0 |
Advil | 5 | 5.0 | .0 |
Aleve | 6 | 5.0 | 1.0 |
Total | 20 |
Test Statistics | |
Painkilling Drug | |
Chi-Square | 1.200a |
df | 3 |
Asymp. Sig. | .753 |
a. 0 cells (0.0%) have expected frequencies less than 5. The minimum expected cell frequency is 5.0. |
the observed frequency for each level of category is shown in the Frequencies Table and the last table shows the chi-square statistic and the P-Value which is 0.753> 0.05 is not significant that means the null hypothesis which stats there is no differences between the population preferences is not rejected.
The second way is used when the data is already collected in the data table which is called Contingency Table or if the sample size is large which is difficult to all the data. In this case, we add another variable which is “Frequency” in the Variable View window. Then, we enter the observed frequencies in the Data View window across the corresponding levels of the variable. Before running the chi-square test, you should inform SPSS that the data of frequency variable is frequencies. With the following procedure, the data will be weighted.
Data→ Weight Cases→ Weight Cases by Frequency Variable
Example 13: Suppose, according to the last survey, the proportions of the painkillers “Aspirin, Tylenol, Advil, Aleve, and other” are 18%, 30%, 35%, 15%, and 2% respectively. A new survey is conducted to determine whether the results are still valid. The following data indicates the preferences of each type of painkiller respectively among 150 people: 22, 40, 45, 25, 18
At α=0.05 do the preferences fit in the last survey?
The categorical variable is “Painkiller” and we label these values as 1= “Aspirin”, 2= “Tylenol”, 3= ”Advil”, 4= ”Aleve”, and 5= ”Other”. Since the second way is used, we add the second variable which is “Frequency” and then enter the data in the Data View window. For the “Painkiller” variable we enter 1 to 5 and corresponding data for the “Frequency” variable. The hypotheses for this example are:
H0: The distribution of frequencies is still valid.
H1: The distribution of frequencies does not fit the result of last study.
To run the test with SPSS first:
Data→ Weight Cases→ Weight Cases by Frequency Variable (move Observed Frequency into Frequency Variable box)
therefore, at the bottom in the right shows “Weight On” then it is ready to run the test:
Analyze→ Nonparametric Tests→ Legacy Dialogs→ chi-square (move the categorical variable (Painkiller Brand) into the test variable list), in Expected Values (select “Values” and fill its box according to the null hypothesis in the order the levels of categorical variable. (in this example, 0.18, 0.3, 0.35, 0.15, 0.02)) → Options (Descriptive)→ ok.
Output:
Descriptive Statistics | |||||||||||
N | Mean | Std. Deviation | Minimum | Maximum | |||||||
Painkiller Brand | 150 | 2.8467 | 1.21918 | 1.00 | 5.00 | ||||||
Painkiller Brand | |||||||||||
Observed N | Expected N | Residual | |||||||||
Aspirin | 22 | 27.0 | -5.0 | ||||||||
Tylenol | 40 | 45.0 | -5.0 | ||||||||
Advil | 45 | 52.5 | -7.5 | ||||||||
Aleve | 25 | 22.5 | 2.5 | ||||||||
Other | 18 | 3.0 | 15.0 | ||||||||
Total | 150 | ||||||||||
Test Statistics | |||||||||||
Painkiller Brand | |||||||||||
Chi-Square | 77.831a | ||||||||||
df | 4 | ||||||||||
Asymp. Sig. | .000 | ||||||||||
a. 1 cells (20.0%) have expected frequencies less than 5. The minimum expected cell frequency is 3.0. |
the observed and expected frequency for each level of categorical variable are shown in the Frequencies Table which the last level “Other” expected frequency is less than 5. Since only one level has the value of expected frequency less than 5, we rely on the result. The last table shows the chi-square test and the P-Value which is 0<0.05 is statistically significant. That means the null hypothesis is rejected or there is enough evidence to support the claim.
We can compare the expected value versus observed value in a graph. First, we define another variable in the variable view that is “Expected” and for entering data simply double click on the Frequencies table in the output and choose the data under “Expected N” then copy and past into Data View window under the Expected variable.
Then run the following procedure:
Graphs→ Legacy Dialogs→ Line (check value of individual cases and choose multiple then define) then check variable under category labels and move categorical variable which is painkiller into the box under Variable. Finally move Expected and Observed Frequency into the Lines Represent box→ Ok
The following graph shows the expected value and the observed value for the categorical value which are not close together that means the null hypothesis is rejected.
Chi-Square Test for Independence (Pearson’s Chi-Square)
The chi-square test can also be used to determine whether two categorical variables are independent when there is one sample selected from one population. The data is usually displayed in the contingency tables which also displayed the levels of each categorical variable. Suppose there are two categorical variables “A” and “B” which “A” has “a” levels and “B” has “b” levels. The chi-square test for independence determines whether the levels of variable A are related with the levels of variable B.
Example 14: Suppose an educator wishes to investigate about teaching methodologies. He randomly selects 300 students in equal status in terms of their average. He divides them into three groups where each group is assigned a different level of teaching. He records the numbers of fail and pass in the final exam for each category in the following table. At α=0.05, is there any association between the number of pass or fail and the type of method?
Pass | Fail | |
Control Group | 20 | 80 |
Old Method | 56 | 44 |
New Method | 70 | 30 |
The hypotheses are:
H0: All outcomes of each method are the same or there is no association between the type of method and the result.
H1: The outcomes of each method are not the same or there is an association between the type of method and the result.
We define our two categorical variables, which are “Method” and “Result”, and assign their levels to the values as the following.
We assign the values of “0” and “1” for “Fail” and “Pass” respectively and the values of “1”, “2”, and “3” for “Control Group”, “Old Method”, and “New Method” respectively. Then we have “Frequency” variable and as we explained in the goodness of fit test we need to weight the frequencies.
Data→ Weight Cases→ Weight Cases by “Frequency” Variable (move the Frequency into Frequency Variable box)
Then:
Analyze→ Descriptive Statistics→ Crosstabs (move two categorical variables to the row and column respectively (usually the independent variable into the “Row” box and the dependent variable into the “Column” box) and check “Display Clustered bar charts”→ Statistics (check the “Chi-square” box and for example, “Phi and Cramer’s V”) → Continue → Cells (check “Observed” and “Expected”) → Continue → Ok.
Output:
Crosstabs
Type of Method * Result Crosstabulation | |||||
Result | Total | ||||
Fail | Pass | ||||
Type of Method | Control Group | Count | 80 | 20 | 100 |
Expected Count | 51.3 | 48.7 | 100.0 | ||
Old Method | Count | 44 | 56 | 100 | |
Expected Count | 51.3 | 48.7 | 100.0 | ||
New Method | Count | 30 | 70 | 100 | |
Expected Count | 51.3 | 48.7 | 100.0 | ||
Total | Count | 154 | 146 | 300 | |
Expected Count | 154.0 | 146.0 | 300.0 |
Chi-Square Tests | |||
Value | df | Asymptotic Significance (2-sided) | |
Pearson Chi-Square | 53.265a | 2 | .000 |
Likelihood Ratio | 56.236 | 2 | .000 |
Linear-by-Linear Association | 49.869 | 1 | .000 |
N of Valid Cases | 300 | ||
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 48.67. |
The first table indicates the observed and expected frequencies for each cell of contingency table. The test statistic for this hypothesis test is the chi-square variable, we mentioned in the goodness of fit test, with (a-1)*(b-1) degree of freedom. That is indicated in the second table.
Since the P-Value is 0.000 for this chi-square test the null hypothesis is rejected which means there is enough evidence to support the claim which is there is an association between the type of method and the results. As the bar chart enables us to visualise the contingency table and represents the numbers of the people who pass this exam is related to the type of method they are assigned to.
Symmetric Measures | |||
Value | Approximate Significance | ||
Nominal by Nominal | Phi | .421 | .000 |
Cramer’s V | .421 | .000 | |
N of Valid Cases | 300 |
The table above indicates the measures of association between the two variables. The values of theses measures are between 0 to 1 which 0 means no association and 1 indicates the perfect association. Since in this example, the two values are 0.421 (greater than 0.3) and considered as a strong relationship.
Test for Homogeneity of proportions
This test is used to compare the proportion of two or more populations which have the same character. Thus, we have two or more different populations that bring two or more samples and we wish to know whether the proportion of these populations that have a common variable are equal. Suppose we wish to compare the proportion of people who have blood pressure from three ethnicities of Asian, Caucasian, and Black so we have three different populations that mentioned above and we need to collect three random samples from each of them. The difference between the test for homogeneity of proportion and the test for independence is only about considering the population that means for the homogeneity test we have two or more different populations as oppose to the independence test that we have one population. For example, the researcher may select 50 samples from each ethnicity to find out whether the proportion of blood pressure of each group is equal to others. The hypotheses are:
H0: P1=P2=P3 v.s H1: At least one proportion is different from the others.
The procedure of SPSS is the same as the test for independency and the assumptions for three tests above are the same.
Example 15: Suppose a simulation center wishes to determine whether the nursing students’ preferences, the medical students’ preferences, and the paramedical students’ preferences are different across the training techniques (classic treatment, simulation treatment, and combination treatment). They randomly select three sample sizes of 200, 100, and 50 each from nursing students, medical students, and paramedical students respectively. According to the samples’ preferences, the following contingency table is provided. At α= 0.05 test the claim that the preferences of nursing students, medical students, and parametric students are distributed equally across the training techniques.
Nursing | Medical | Paramedical | |
Classic T. | 50 | 20 | 20 |
Simulation T. | 80 | 40 | 10 |
Combination T. | 70 | 40 | 20 |
Total | 200 | 100 | 50 |
The hypothesis:
H0: The proportions of preferences across the fields are equal for each training technique.
H1: At least one proportion is different from the others.
The procedure for SPSS:
We define two categorical (Nominal) variables which are ”Technique” and “Field” and assign the values of “1”, “2”, and “3” for their levels.
Thus, we enter data into data view according the variable labels and the contingency table above. Then weight the frequencies.
Data→ Weight Cases→ Weight Cases by Frequency Variable (move the “Frequency” variable into “Frequency Variable” box).
Finally:
Analyze→ Descriptive Statistics→ Crosstabs (move “Field” into the Column box and “Technique” into the Row box (usually put the variable that indicates the populations into the column) and check “Display Clustered bar charts”) → Statistics (check the Chi-square box) → Cells (check for example Expected and for Residuals Standardized) → Ok.
The following table, indicates the observed and expected frequency for each field in each method and the residuals that indicate the differences between them. It also provides the proportion of each field in each method that we are comparing. If you look at the observed frequency indicated by “Count” the columns that have the same subscript letter are not significantly different. For example, the proportion of nursing students and medical students who prefer the classic training are not significantly different although the proportion of paramedical students who prefer the classic training is statistically different.
The second table indicates the value of chi-square which is 10.47 and the P-Value is 0.033. Since the P-Value is less than 0.05, we reject the null hypothesis that means we have sufficient evidence to reject the claim. In another word, the opinion of students in terms of their field are different in each method.
Finally, the bar enables to visualize the groups in each method which are not the same.
Output:
Training Techniques * Filed of study Crosstabulation | ||||||
Filed of study | Total | |||||
Nursing | medical | Paramedical | ||||
Training Techniques | Classic Treatment | Count | 50a | 20a | 20b | 90 |
Expected Count | 51.4 | 25.7 | 12.9 | 90.0 | ||
% within Filed of study | 25.0% | 20.0% | 40.0% | 25.7% | ||
Residual | -1.4 | -5.7 | 7.1 | |||
Simulation Treatment | Count | 80a | 40a | 10b | 130 | |
Expected Count | 74.3 | 37.1 | 18.6 | 130.0 | ||
% within Filed of study | 40.0% | 40.0% | 20.0% | 37.1% | ||
Residual | 5.7 | 2.9 | -8.6 | |||
Combination Treatment | Count | 70a | 40a | 20a | 130 | |
Expected Count | 74.3 | 37.1 | 18.6 | 130.0 | ||
% within Filed of study | 35.0% | 40.0% | 40.0% | 37.1% | ||
Residual | -4.3 | 2.9 | 1.4 | |||
Total | Count | 200 | 100 | 50 | 350 | |
Expected Count | 200.0 | 100.0 | 50.0 | 350.0 | ||
% within Filed of study | 100.0% | 100.0% | 100.0% | 100.0% | ||
Each subscript letter denotes a subset of Filed of study categories whose column proportions do not differ significantly from each other at the .05 level. |
Chi-Square Tests | |||
Value | df | Asymptotic Significance (2-sided) | |
Pearson Chi-Square | 10.470a | 4 | .033 |
Likelihood Ratio | 10.782 | 4 | .029 |
Linear-by-Linear Association | .071 | 1 | .790 |
N of Valid Cases | 350 | ||
a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 12.86. |
The Sign Test (Single Sample Sign Test)
The sign test is used to determine whether a median of a population is equal to a specific value according to the null hypothesis. Then a sample of data are randomly selected and compared with the median value from the null hypothesis. If the value of data is greater than the value of median based on the null hypothesis, then the data is changed to plus sign. Otherwise the data is changed to minus sign or zero when two values are equal. The test value is the smaller number of plus or minus signs. The sign test is a non-parametric alternative to a one sample T-test test for the mean.
The hypotheses are:
H0: M=M0 v.s H1: M≠M0 (two tailed) or
H0: M=M0 v.s H1: M>M0 (right tailed) or
H0: M=M0 v.s H1: M<M0 (left tailed)
Example 16: Suppose an instructor runs a satisfaction survey with a sample of 15 students who take a new online course. The students measure their satisfaction with the ratings of 1 to 5 where 1 = ”extremely dissatisfied”, 2 = ”dissatisfied”, 3 = ”neutral”, 4 = ”satisfied”, and 5 = ”extremely satisfied”. The following represents the data collected. At α=0.05, are at least half of the students satisfied?
4, 5, 5, 3, 1, 2, 5, 2, 5, 4, 1, 5, 3, 4, 2
In this example, we try to test whether the median of the population is equal to 4 considering that the data are ordinal.
The hypotheses are:
H0: M=4 v.s H1: M≠4
To run the single sample sign test with SPSS:
Analyze→ Nonparametric test→ One sample→ setting→ check Customized test→ check “Compare median to hypothesized” and type your hypothesized median which in this example is 4→Run
Output:
The table above shows the P-Value =0.127 which is greater than 0.05 that means we fail to reject the null hypothesis and the median score is 4. Therefor at least 50% of students are satisfied about the course.
Non-parametric tests for Paired Sample: The Sign Test and The Wilcoxon Signed Rank Test
As mentioned earlier, if the assumptions for the parametric tests such as “normality” or “no significant outliers” are not met or if we have an ordinal dependent variable, then we can apply an alternate non-parametric test. The non-parametric alternative tests to the T-Test for the dependent samples are the sign test and the Wilcoxon signed rank test. As mentioned in the paired sample T-Test, the non-parametric equivalent tests, mentioned above, also analyse the differences in the dependent variable obtained from each paired. However, theses tests consider the median of the differences as opposed to the T- Test for paired samples that determines the mean of the differences. Thus, we wish to know whether the median of differences is equal to zero to conclude that there is no difference between the two related samples. The Wilcoxon signed rank test is used when the distribution of the differences between the groups is symmetrical in shape and usually has more power of the test. Generating a box plot enables us to determine whether the distribution of the differences is symmetrical in shape.
To run the test with SPSS:
Analyze→ Nonparametric Test→ Legacy Dialog→ 2 Related Samples (move two dependent variables to the Test Pairs and choose one of the test types) → Ok
Example 17: Suppose in example 16, the students for the second time are assigned to take the online course with the new representation. The researcher conducts the second survey after the end of the course. The following are the data of the students’ satisfaction ratings for the second course. At α= 0.05 can the researcher conclude that the students’ satisfaction differs from the first survey?
4, 5, 4, 4, 3, 1, 4, 3, 5, 5, 3, 4, 2, 3, 1
The hypotheses are:
H0: The students’ satisfaction does not change.
H1: The students’ satisfaction differs (claim).
To run the test with SPSS, first, we define the two related variables which are “First” and “Second” and enter the data that indicate the values of dependent variable from the samples in each respective group. Then we create the new variable denoted by D indicating the differences between the first and second groups. As we mentioned earlier, we use the following procedure to generate “D”:
Transport → Compute Variable → define D as a target variable and in the “Numeric Expression” box, type “First -Second” → Ok
The following procedure generates the box plot of the variable D:
Analyze → Descriptive Statistics → Explore (move variable “D” into the “Dependent List” box) → Statistics (check “Outliers”) → Continue → Plots (check “Histogram”) → Continue → ok
Output:
The box plot above, represents that the distribution of differences is not symmetrical. The area above the median is not equal to the area under the median since there is no upper whisker. The histogram above also confirms the skewness which is negative. In this case, we choose the sign test.
Sign Test
Frequencies | ||
N | ||
Second Survey – First Survey | Negative Differencesa | 7 |
Positive Differencesb | 5 | |
Tiesc | 3 | |
Total | 15 | |
a. Second Survey < First Survey | ||
b. Second Survey > First Survey | ||
c. Second Survey = First Survey |
Test Statisticsa | |
Second Survey – First Survey | |
Exact Sig. (2-tailed) | .774b |
a. Sign Test | |
b. Binomial distribution used. |
The first table represents the number of negative, positive, and no differences between the scores of students after two surveys. The second table indicates the P-Value which is not significant and means the median of differences between the first survey and second survey is equal to zero. Therefore, the researcher fails to reject the null hypothesis and does not have enough evidence to support the claim. Thus, the new presentation does not affect the students’ satisfaction.
Nonparametric Test for Independent Samples: The Wilcoxon Rank Sum Test / Mann Whitney U Test
The Wilcoxon rank sum test or Mann Whitney U test is used to compare a dependent variable between two independent groups. Thus, it is a non-parametric equivalent test to the two independent samples T-Test where the assumptions (normality, no outliers) are not met. Therefore, there is a categorical variable with two levels which each level represents each population and one quantitative or ordinal variable which will be compared between the two independent samples from the two respective populations. The individuals or observations in each group should be randomly selected and have no relationship. The Mann Whitney U test compares the medians of the population if the two distributions have the similar shapes. Otherwise, the mean ranks would be compared. To indicate that two populations have the same shape the following procedure is used to generate a corresponding histogram and boxplot. (entering data is the same as entering data in two independent samples T-Test)
Analyze → Descriptive Statistics → Explore (move the dependent and categorical variables into the appropriate boxes) → Statistics (check “Outliers”) → Continue → Plots (check “Histogram”) → Continue → ok
Example 18: Suppose a researcher wishes to determine whether the injections are less painful if the patient looks away. He randomly selects 40 patients who are assigned to get the same injection and divides them in two groups. The first group is asked to look directly into the injection place and the second group is asked to look away at the time of injection. After the test, each patient is asked to rate their pain from a range of 1 to 4 where 1= ”neutral”, 2= ”slightly painful”, 3= ”painful”, and 4= ”extremely painful”. At α=0.05, can the researcher conclude that the injection is less painful if you look away? The following are the data of each group:
The first group: 2, 3, 3, 2, 1, 2, 2, 2, 2, 3, 3, 4, 2, 2, 4, 3, 2, 2, 1, 1
The second group: 2, 2, 1, 1, 3, 1, 4, 3, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2
The hypotheses:
H0: There is no difference in the distribution populations
H1: There is a difference in the distribution population that means the injection is less painful when we look away (claim).
To run the test with SPSS first we define dependent variable “Score” and the categorical or group variable which is in this example “Group”. Then we assign the values of 1 and 2 for “The first group (look)” and “The second group (not look)” which are the levels of the “Group”. Finally, we enter the data as we already entered in the two independent samples T-Test.
As mentioned earlier, first we check the shape of two population distributions with the procedure mentioned above.
Output:
The histograms and box plots above, represent the shape of each respective group. The second group is more skewed to the right than the first group which is also slightly right skewed. The box plots represent the medians of both groups are equal to 2. However, in the first group, the median is equal to the first quartile and close to the lower whisker whereas in the second group, the median is equal to the third quartile. In this example, the distributions of the two group have approximately the same shape.
Analyze→ Nonparametric Test→ Legacy Dialog→ 2 Independent Samples (move the dependent variable “Score” into the “Test Variable List” box and the group variable “Group” into the “Grouping Variable” box) → Define Groups (enter the values that is selected for the levels (in this example 1 and 2)) → Continue → select test type (Mann-Whitney U) → Ok
Ranks | ||||
Group | N | Mean Rank | Sum of Ranks | |
Pain Score | The first group (look) | 20 | 23.15 | 463.00 |
The second group (not look) | 20 | 17.85 | 357.00 | |
Total | 40 |
Test Statisticsa | |
Pain Score | |
Mann-Whitney U | 147.000 |
Wilcoxon W | 357.000 |
Z | -1.566 |
Asymp. Sig. (2-tailed) | .117 |
Exact Sig. [2*(1-tailed Sig.)] | .157b |
a. Grouping Variable: Group | |
b. Not corrected for ties. |
The first table indicates the mean ranks of the groups and the second table indicates the result of the non-parametric test (Mann-Whitney) for two independent groups. The P-Value is 0.117 and greater than 0.05 which is not statistically significant. In this example, the researcher does not have enough evidence to support the claim and consequently he fails to reject the null hypothesis in terms of the differences between the medians. Therefore, based on this data, looking away at the time of injection does not decrease the pain.
Nonparametric test with more than two independent samples (The Kruskal-Wallis Test)
The Kruskal-Wallis test is a non-parametric alternative to the one-way ANOVA test. Thus, the test is used to compare more than two independent samples selected from more than two corresponding populations in terms of one dependent quantitative or ordinal variable. If the assumptions for the ANOVA test which are mentioned previously such as “normality”, “homogeneity of variances”, and “no outliers” are not met the equivalent nonparametric test “Kruskal-Wallis” or the H test or the one-way ANOVA on ranks can be used. However, the Kruskal-Wallis test needs to have at least 5 individuals or observations in each group that are independent of one another. To run the Kruskal-Wallis test, we just check the distributions on the dependent variable across the levels of the independent (group) variable that need to be the same shape in order to interpret the medians. Otherwise, the mean ranks of the groups would be interpreted. As explained in the Mann Whitney U Test, we can check the distributions of the groups by the histograms and boxplots.
Example 19: Suppose in example 18, the researcher selects randomly 20 more patients to be assigned to get the same injection as well. This group is get distracted at the time of injection and labeled “The third group (distraction)”. At α=0.05 can the researcher conclude the pain score across the groups are the same?
The hypotheses are:
H0: There is no difference between the pain scores across the groups
v.s
H1: At least one group has a different result from the others
The data for the third group:
1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 3, 2, 2, 3
To run the test with SPSS, first we define dependent variable “Score” and the categorical or group variable which is in this example “Group”. Then we assign the values of 1, 2, and 3 for “The first group (look)”, “The second group (not look)”, and “The third group (distracted) respectively. The same procedure (Explore) used in example 18 will be applied to generate the following histograms and box plots.
Output:
The third group is positively skewed as well as the others. Therefore, we can compare the medians of the groups, since the distributions of the groups have approximately the same shape. The following procedure is used to run the Kruskal-Wallis test.
Analyze→ Nonparametric Tests→ Legacy Dialogs→ K Independent Samples (move the dependent variable into the “Test Variable List” box and the group variable into the “Grouping variable” box) → Define Range (specify minimum and maximum of the values for the groups) → Option (check Descriptive) → Test type (check Kruskal-Wallis H) → Ok
Output:
Ranks | |||
Group | N | Mean Rank | |
Pain Score | The First Group (look) | 20 | 37.00 |
The Second Group (not look) | 20 | 29.10 | |
The Third Group (distracted) | 20 | 25.40 | |
Total | 60 |
Test Statisticsa,b | |
Pain Score | |
Chi-Square | 5.507 |
df | 2 |
Asymp. Sig. | .064 |
a. Kruskal Wallis Test | |
b. Grouping Variable: Group |
The first table represents the mean ranks of the groups and the second table represents the test value (5.507) and the degree of freedom (2) and more importantly the P-Value (0.065) which is not statistically significant. That means, the medians of the three groups are equal and there is no difference between the distributions of pain scores across the groups. Since in this example, the null hypothesis is not rejected and the medians are equal across the groups, a post hoc test is not needed to determine where the differences are. If we need to apply a post hoc test to ascertain any differences, the following ways are recommended:
- Apply the Mann Whitney U test for each two groups and to prevent the inflation of the type I error we use the Bonferroni method to calculate a new significant level (α). The following formula is used to calculate the Bonferroni α (B). Where “c” is the number of comparison that will be conducted.
B=C
- Analyze → Nonparametric Tests → Independent Samples → Fields (move the dependent variable into the “Test Fields” box and the independent variable into the “Groups”) → Setting (select “Customize test” and check “Kruskal-Wallis 1-way ANOVA k samples”) → Run
The second procedure generates the following output which by double clicking on the table, a new window “Model Viewer” will be opened that has two parts. The left part includes the following table and the right part represents the boxplot and the summery of the test. If the null hypothesis is rejected, at the bottom right of the window, under the “View” select “Pairwise Comparison” to generat the table that includes the comparison of each two groups.
Output:
The Spearman’s Rank-Order Correlation Coefficient
The nonparametric equivalent to the Pearson Correlation Coefficient test is the Spearman rank correlation coefficient test denoted with “rs“ for samples and “” for population that can be used when two variables are quantitative or ordinal. The Spearman correlation coefficient is used to determine the strength and direction of a monotonic relationship between two variables while the Pearson correlation coefficient assesses the strength and direction of a linear relationship between two variables. To run the test the normality assumption is not required as opposed to the Pearson correlation coefficient. However, the data should be monotonically related (the linearity is not required) while they can be continuous or discrete including ordinal.
Example 20: Suppose in Example 16, the instructor wishes to know whether there is a relationship between the students’ ratings and their final scores. At α= 0.05 conduct the Spearman Rank test to determine the strength and direction of any association. The following represents the data from the sample of 15 students.
The ratings: 4, 5, 5, 3, 1, 2, 5, 2, 5, 4, 1, 5, 3, 4, 2
The scores: 87, 85, 78, 70, 75, 65, 80, 70, 81, 78, 60, 78, 68, 72, 71
The hypotheses are:
H0: ρ=0 v.s H1: ρ≠0
Define two variables “Rating” and “Score”, which are ordinal and scale variables respectively, and enter the data above in the “Data View” window. As in Example 16, in the “Rating” variable, we identify the labels for the values of 1 to 5 as extremely satisfied, dissatisfied, neutral, satisfied, and extremely satisfied respectively.
To run the test with SPSS, first we check the monotonically relationship with the scatter plot:
Graphs→ Chart Builder→ Choose from Scatter/Dot (drag the first square to the top right and choose the X-Axis and Y-Axis with the variables (usually independent variable into X-Axis box and dependent variable into Y-Axis)) → Ok
If there are more than two variables the following procedure is more appropriate:
Graphs→ Legacy Dialogs→ Scatter/Dot (select Matrix Scatter) → Define (move all the variables except nominal one into the Matrix Variables box) → Ok
Output:
As the two scatter plots above demonstrate, the two variables have a monotonic or even approximately linear relationship where, as the scores increase, the rates increase in value. Therefore, we run the Spearman Rank test with the following procedure to determine the strength of the relationship.
Analyze→ Correlate→ Bivariate (move the variables into the right box and check the spearman box) → Ok
Output:
Correlations | ||||
The Rating of Staisfaction | The Final Score | |||
Spearman’s rho | The Rating of Staisfaction | Correlation Coefficient | 1.000 | .764** |
Sig. (2-tailed) | . | .001 | ||
N | 15 | 15 | ||
The Final Score | Correlation Coefficient | .764** | 1.000 | |
Sig. (2-tailed) | .001 | . | ||
N | 15 | 15 | ||
**. Correlation is significant at the 0.01 level (2-tailed). |
The table above indicates the Spearman correlation coefficient value (rs=0.764) which is very strong and the P-Value (0.001) that is less than 0.05. thus, the null hypothesis is rejected that means there is a significant strong positive correlation between the two variables.
The Runs Test
The Runs test is a nonparametric test that is used to determine the randomness of the order of observations based on the concept of a run. A run is a sequence of similar observations in terms of their values or codes from a sample of different observations. For example, consider the following data, where 1 indicates male and 2 indicates female.
1, 1, 1, 2, 2, 1, 2, 2, 1, 2, 1
(1, 1, 1), (2, 2), (1), (2, 2), (1), (2), (1)
Each parenthesis represents a run that includes identical codes of 1 or 2. Therefore, in this sequence of observations, we have 7 runs. If the variable is continuous, the values of observations are considered and compared to for example the mean or median.
Example 21: Suppose a researcher selects a sample of 20 students and records their fields. Consider the following data where the values of 1 to 4 are assigned to “medicine”, “nursing”, “dentistry”, and “paramedical” fields respectively. Test for randomness of the order of observations at α= 0.05.
1,1,2,2,1,3,4,2,1,4,3,1,2,4,3,1,2,4,4,4
Define one nominal variable “Filed” with 4 levels that are coded the values of 1 to 4 as explained above.
The hypotheses are:
H0: The students are selected randomly according to their fields (claim)
v.s
H1: The sample is not selected randomly.
The following is the procedure of the Runs test:
Analyze→ Nonparametric Test→ Legacy Dialogs→ Runs (move the variable into the “Test Variable List” and check Median or Mean or Mode or Custom to indicate the cut point (for example the mean of codes)).
Output:
Runs Test 3 | |
Field of Study | |
Test Valuea | 2.5000 |
Total Cases | 20 |
Number of Runs | 8 |
Z | -1.114 |
Asymp. Sig. (2-tailed) | .265 |
a. User-specified. |
The table above, is based on the cut point of 2.5 and indicates the P-Value (0.265) which is not statistically significant. That means the null hypothesis is not rejected or there is not enough evidence to reject the claim. Thus, the students are selected randomly according to their fields.
If we have a sequence of numerical data and we wish to know whether they are selected at random then we check Median or if we want to compare data with the specific number, we check Custom and type the number into the box.
Correlation and Regression
The regression analysis technique is used to determine the relationship between two or more variables, diagnose the relationship model and make the predictions.
Linear Regression
Simple Linear Regression is used when the relationship between two numerical variables is linear. And the
correlation indicates the strength of the linear relationship or even existence.
Simple regression involves two variables one independent variable which is sometimes called explanatory or predictor variable denoted by “x” and one dependent variable which is sometimes called response or predicted variable denoted by “y”. If a relationship between these variables exists, it might be positive or negative relationship. Positive relationship occurs when the value of two variables increase or decrease at the same time. Although, negative relationship occurs when the value of one variable increases while the other one decreases and vise versa. The scatter plot is the most important plot to determine this kind of relationship. The independent variable is plotted on the x-Axis and the dependent variable is plotted on the y-Axis.
To obtain the scatter plot with SPSS the following procedure is used:
Graphs→ Legacy Dialogs→ Scatter/Dot (choose simple Scatter) → Define (move the independent variable into the X-Axis box and the dependent variable into the Y-Axis box)→ Ok.
The following plots are some possible plots for each two variables.
a b c d
e f g h
As the plots above show we can see positive relationship in the first row which increases from left side to the right side. In the second row the relationship is negative but increases from the left side to the right side. The plots “a” and “e” show weak relationship while the plots “d” and “h” show very strong linear relationship. The following plot shows strong relationship although it is not linear and called curvilinear relationship.
Correlation coefficient
The correlation coefficient measures the strength of the relationship between two numerical variables. The most common correlation coefficient is the Pearson product-moment correlation coefficient that measures the strength and direction of the linear relationship between two variables and denoted for the sample by “r” and for the population by “ρ or R”. The range of value for the correlation coefficient or R is from -1 to +1. The sign indicates the direction and the absolute value indicates the strength of the relationship between the two variables. Thus, for the R=-1 there is a perfect negative linear relationship between the two variables and for the R=+1 there is a perfect positive linear relationship between two variables. The more the value of the correlation coefficient is close to -1 or +1 the relationship between variables is stronger or vise versa. If there is no linear relationship between two variables the value of R is equal to zero. However, if R is almost equal to zero we can not say there is no relationship while it might be an association between two variables but not linear. The correlation coefficient is unitless and sensitive to outliers. For example, for the plots above the following represents their correlation coefficients.
Plot a: r=0.33 Plot b: r=0.69 Plot c: r=0.98 Plot d: r=1
Plot e: r=-0.08 Plot f: r=-0.64 Plot g: r=-0.92 Plot h: r=-1
SPSS procedure:
Analyze→ Correlate→ Bivariate→ Ok
Example 21: Elementary_Statistics__A_Step_By_Step_Approach___8th_Edition_ copy (page:537)
Construct a scatter plot for the data obtained in a study on the number of absences and the final grades of seven randomly selected students from a statistics class. And compute the value of the correlation coefficient.
Student Number of absences x Final grade y (%)
A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78
The explanatory or independent variable (x) is the number of absences and the response or dependent (y) variable is the final grade.
For the scatter plot the following procedure is used:
Graphs→ Legacy Dialogs→ Scatter/Dot (choose simple Scatter) → Define (move Number of absences into the X-Axis box and Final grade into the Y-Axis box) → Ok.
As we expected there is a strong negative linear relationship between the two variables.
The following procedure gives the value of correlation coefficient.
Analyze→ Correlate → Bivariate (move Number and Absences into the Variables box) → Pearson→ Ok
Output:
Correlations | |||
Number of absences (x) | Final grade(y) | ||
Number of absences (x) | Pearson Correlation | 1 | -.944** |
Sig. (2-tailed) | .001 | ||
N | 7 | 7 | |
Final grade(y) | Pearson Correlation | -.944** | 1 |
Sig. (2-tailed) | .001 | ||
N | 7 | 7 | |
**. Correlation is significant at the 0.01 level (2-tailed). |
The table above represents the value for the Pearson correlation that is -0.944 which is high enough to say there is a strong relationship between the two variables as the scatter plot indicated earlier.
The existence of the linear relationship between two variables for the population can be tested by the following hypotheses:
H0: ρ=0 v.s H1: ρ≠0
Since ρ=0 indicates there is no obvious linear relationship between the two variables in the population.
First ae need to select a random sample of (x,y) from the population which x is a data of variable X or independent variable or as we mentioned before explanatory variable and y is a data of the respond or dependent variable (Y). That means for each x there must be one y or vise versa thus we have pair data.
Assumptions:
- The (x,y) pairs are a random sample from a bivariate normal population.
- The (x,y) pairs are independent.
The T test is used to test the hypothesis and the test statistic is:
t=r×n-21-r2 with “n-2” degree of freedom, “r” the correlation coefficient of the sample and “n” the sample size.
The SPSS procedure is the same as above:
Analyze→ Correlate → Bivariate (move Number and Absences into the Variables box) → Pearson→ Ok
We always should be cautions about misinterpreting the correlation coefficient. The correlation coefficient is only for detecting the linear relationship and the high value is not always implying the strong linear relationship while it might be because of the third variable that is related to both variables. Thus, the other variables must be investigated. Correlation does not imply causation.
The next step after assuming the linear relationship between the two variables by the scatter plot and the correlation coefficient test we can fit a best line on the scatter plot and the equation for that to predict the value for the respond variable. The simple linear regression has the following equation:
yi=β0+β1xi
where β0 and β1 are the linear model parameters and can be estimated by the sample (β0 is the intercept and β1 is the slope of the linear equation). The point estimates for those parameters are denoted by b0 and b1 respectively. Then the equation for the best fitting line would be:
ŷi = b0 + b1 xi
where for the ith observation (xi , yi ) or data yi is the observed response and xi is the predicted value and ŷi is the predicted response. Since not all the observations fall on the line hence, we look for the best line that can be fitted on the observation and captures the most of them. The slop of the equation (b0) indicates the direction or trend whether is downward or upward thus its sign is the same as the sign of the correlation coefficient.
the best line is called the least square line which means this line minimizes the sum of the squared residuals.
The slop and intercept of the least square line (b0 and b1) can be computed by the following formulas:
b1=sysxr , b0 =ȳ-b1x͞
Where: sy : the sample standard deviation of the response variable
sx : the sample standard deviation of the explanatory variable
r : the correlation coefficient between the two variables
ȳ : the sample mean of the response variable
x͞ : the sample mean of the explanatory variable
Interpretation for the slope (b1): the regression model predicts that for each unit increase in the explanatory variable “x”, the response variable “y” increases or decreases in average by ǀb1ǀ units.
Interpretation for the intercept (b0): the regression model predicts the response variable is equal to the intercept when the explanatory variable is equal to 0.
Assumptions for the simple linear regression model:
- Linearity
- Nearly normal residuals
- Constant variability or homoscedastic
- Independent observations
Residuals
Consider the ith observation (xi,yi), by plugging the xi into the regression model (ŷi = b0 + b1 xi ) the predicted response variable ( ŷi ) is obtained therefor the residual for this observation is the difference between the observed response value and the predicted response value or the vertical distance between the observation to the line:
ei=yi– ŷi
The observations above the regression line have positive residual however the observations under the regression line have the negative residual. The ideal line is the one that generate the smallest residual. Residual plot is also helpful and sometimes is more obvious to see any curvature or pattern when we asses the linear relationship between the two variables. In the case of strong relationship between the variables the residuals are scattered around the horizontal line that represent 0 (since ei=0) without any pattern or curvature. We usually plot residual against explanatory variable “x” or predicted response variable “ŷi”. The assumptions for the residuals are the almost the same for the simple linear regression and we can almost check them by the residual scatter plot. The following plots are some possible plots and their problem.
These following plots indicate that there is no pattern or curvature in the residuals and we can see constant variability.
Thus, it is appropriate to fit the linear model to the data even though the right plot indicates the weak linear relationship. The following plot indicates the variance of the residuals is increasing when “x” is increasing. Thus, in the dataset also we have this increasing variability and the linear model is not appropriate.
The following plot indicates the curvature thus, the linear model is not appropriate.
The following plot indicates the outliers which one of them is very far away from the line and if we want to fit the linear model then they should be investigated.
Finally, the following plot indicates the time effect and has not been properly considered in the model.
The following procedures generates the scatter plot for the residuals:
- Analyze→ Regression→ Linear (move dependent and independent variables into their boxes) →
Plots (move “*ZPRED” into the X box and “*ZRESID” into the Y box) and check normal probability plot → Continue → Ok.
- If we wish to plot the residuals against the explanatory variable first we should save the residuals by the following procedure:
Analyze → Regression → Linear (move the variables into the right spot) → Save (check Unstandardized Residuals) → Continue → Ok.
Then
Graphs → Legacy Dialogs → Scatter/Dot → Simple scatter → Define (move the independent variable “x” into the X Axis and the Unstandardized Residuals into the Y Axis) → Ok.
To draw a horizontal line that indicates “0” double click on the plot:
Options→ Y Axis Reference Line → Position (type “0” into the box) → Apply → Close.
Example 22: Stories and heights of buildings data follow:
We generate the scatter plot using the first procedure above for the residuals when the explanatory variable is Stories and the response variable is Heights.
The plot indicates no pattern and curvature. The variability is almost constant even though the first observation has more distance to the zero line compare to the rest. The normal P-P plot indicates that the residuals are approximately normal distributed.
The following plot was generated by the second procedure above.
For the dataset above we run the following procedure to obtain the point estimates for parameters and writ the equation for the least square line and even the hypotheses for the slope which are:
H0: β1=0 v.s H1: β1≠0
The null hypothesis indicates that there is no linear relationship between the variables.
Analyze → Regression → Linear (move the variables into the right box) → Statistics (you can check Descriptives) → Continue → Ok.
Output:
Descriptive Statistics | |||
Mean | Std. Deviation | N | |
Heights(y) | 606.5000 | 109.07923 | 10 |
Stories(x) | 43.2000 | 9.39030 | 10 |
Correlations | |||
Heights(y) | Stories(x) | ||
Pearson Correlation | Heights(y) | 1.000 | .797 |
Stories(x) | .797 | 1.000 | |
Sig. (1-tailed) | Heights(y) | . | .003 |
Stories(x) | .003 | . | |
N | Heights(y) | 10 | 10 |
Stories(x) | 10 | 10 |
The table above indicates the correlation coefficient between the two variables is 0.797 which is strong and positive.
ANOVAa | ||||||
Model | Sum of Squares | df | Mean Square | F | Sig. | |
1 | Regression | 68072.707 | 1 | 68072.707 | 13.959 | .006b |
Residual | 39011.793 | 8 | 4876.474 | |||
Total | 107084.500 | 9 | ||||
a. Dependent Variable: Heights(y) | ||||||
b. Predictors: (Constant), Stories(x) |
The table above is the ANOVA table that shows the P-Value for the hypotheses for the slop which is 0.006 that means the null hypothesis is rejected. Thus, there is a linear relationship between the two variables.
Coefficientsa | ||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | ||
B | Std. Error | Beta | ||||
1 | (Constant) | 206.399 | 109.340 | 1.888 | .096 | |
Stories(x) | 9.262 | 2.479 | .797 | 3.736 | .006 | |
a. Dependent Variable: Heights(y) |
And finally, the table above gives the b0 and b1 respectively 206.399 and 9.262 to write the equation.
ŷi = 206.399 + 90262 xi
Thus, with this equation we can predict the height of the building when we have the story. Although we should be cautious about extrapolation which means applying the model to predict the value which is beyond the original range of the dataset or observations.
Outliers:
In regression, we consider an observation as an outlier when it falls far from the cloud of points which a scatter plot or residual plot can represent. We are interested whether an outlier influences the least square line and how since an outlier can pull the least square line towards up or down and controls the slope which is called influential outlier. If an outlier is close to the least square line that does not influence the line and its slope or even the influential is slightly then it is not considered to be influential. If an outlier has extreme x value compared to other points that cause to fall horizontally away from the cloud has more influence on the line and its slope is called high leverage. However, we can not ignore or delete the outliers without a proper explanation.
Coefficient of determination:
After fitting the least square line which is the best fit on the dataset and computing the equation of the line we can use it to predict the value of the dependent variable with the value of the independent variable. However, the question is that to what extent the prediction is close to the real value. It can be responded by using the coefficient of determination denoted by R2 (for the sample we use r2) or R- squared which is R (the correlation coefficient) squared. R-squared is a measure that indicates the proportion of variability of the dependent or response variable that is predicted by the independent or explanatory variable. In another word R2 indicates that how well the least square line explains the dependent variable.
Since R2 is squared the range of R2 is between 0 to 1 so the sign is always positive and usually expressed as a percentage. For example, R2=90% indicates that 90 % of the variability of the response variable is explained by the independent variable which is very high and the model is almost perfect. Hence, we wish to have the high value close to 100% of R-squared. On the other hand, the value of “0” of R-squared indicates that we can not use the model to predict the dependent variable.
Regression with categorical independent variable
Previously we explained about simple linear regression using continuous dependent and independent variable. Although it is possible to have a categorical independent variable. Hence, we need to introduce a value for each level of the variable to transform the categorical variable to numerical variable. We code the levels of the explanatory variable with the value of “0” (reference level) and value of “1”. Hence, called regression with a 0/1 variable. For example, we wish to predict the frequency of seeing a doctor based on smoking habit (non-smoker or smoker). In this example the explanatory variable is categorical and has two levels which are non-smoker and smoker. Hence, we use a variable called smoking (indicator variable or dummy variable) that takes value of “0” if the person is non-smoker and value of “1” if the person is smoker. Suppose we explain the response variable as “Frequency” and the explanatory variable as “Smoking” and the equation for the model is:
Frequency=b0+b1*smoking
For people who are non-smoker the value of x is “0” and we plug “0” in the equation so that would give:
Frequency=b0+b1*0=b0
That means the predicted frequency of seeing a doctor for people who are non-smoker is equal to intercept. Hence, we can use that for interpretation the intercept that the intercept for the linear regression model “b0” is equal to the value of predicted response variable when the explanatory variable has reference level or “0”.
If we plug “1” into the model for anther level of explanatory variable in this exam smoker people that gives:
Frequency=b0+b1*1 =b0+b1
Hence, we can interpret the slope “b1” the deference between the two value of the predicted response variable when the explanatory variable takes “0” and “1”.
Example 23: https://onlinecourses.science.psu.edu/stat501/node/380
Is a baby’s birth weight related to the mother’s smoking during pregnancy?
Researchers (Daniel, 1999) interested in answering the above research question collected the following data (birthsmokers.txt) on a random sample of n = 32 births:
In the dataset above we have Weight (continuous variable) that is birthweight of baby in grams, Gest (continuous variable) that is length of gestation in weeks and Smoking (categorical variable) that is smoking status of mothers (yes or no).
First, we consider the Weight as a response (y) variable and the Smoking as an explanatory (x) variable. So, the model has only one categorical variable with two levels that we need to code them “0” for no and “1” for yes. Hence, we define the variables (Weight, Gest and Smoking) into the Variable View window and for the Smoking, yes takes the value of 1 and no takes the value of 0. Then enter the data into the Data View window. To run SPSS to generate the scatterplot:
Graphs→ Legacy Dialogs→ Scatter/Dot (choose simple Scatter) → Define (move Smoking into the X-Axis box and Weight into the Y-Axis box)→ Ok.
With double clicking on the plot the Chart Editor window pops up then:
Elements → Fit Line at Total (close Properties and Chart Editor windows)
The least square line indicates very weak relationship between the two variables however we run regression with SPSS. As the scatter plot shows the linearity assumption is met since there is one categorical variable with two levels therefore this assumption is always met. So, we can test two other assumptions normality and homoscedasticity for residuals with the following procedure:
Analyze → Regression → Linear (move the variables into the right spot) → Save (check Unstandardized Residuals) → Continue → Ok.
Then
Graphs → Legacy Dialogs → Scatter/Dot → Simple scatter → Define (move Smoking “x” into the X Axis and the Unstandardized Residuals into the Y Axis) → Ok.
There is no pattern on the residual scatter plot or obvious increase or decrease in variability.
Thus, SPSS procedure:
Analyze → Regression → Linear (move the variables into the right box) → Statistics (you can check Descriptives and estimates and Confidence Intervals) → Continue → Ok.
Some outputs:
Correlations | |||
Wieght | Smoking | ||
Pearson Correlation | Wieght | 1.000 | -.135 |
Smoking | -.135 | 1.000 | |
Sig. (1-tailed) | Wieght | . | .230 |
Smoking | .230 | . | |
N | Wieght | 32 | 32 |
Smoking | 32 | 32 |
This table indicates the Pearson Correlation -0.135 that means the weak relationship between the two variables. Since also in the following table the R-square is very low and indicates only 1.8 % of the response variables are explained by the explanatory variables.
Model Summaryb | ||||||||
Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | ||||
1 | .135a | .018 | -.014 | 349.63502 | ||||
a. Predictors: (Constant), Smoking | ||||||||
b. Dependent Variable: Wieght | ||||||||
Coefficientsa | ||||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | 95.0% Confidence Interval for B | |||
B | Std. Error | Beta | Lower Bound | Upper Bound | ||||
1 | (Constant) | 3066.125 | 87.409 | 35.078 | .000 | 2887.613 | 3244.637 | |
Smoking | -92.500 | 123.615 | -.135 | -.748 | .460 | -344.955 | 159.955 | |
a. Dependent Variable: Wieght |
With using the table above, we find the intercept b0=3066.125 and b1=-92.5 but as the table shows the P-Value for the slope (b1) is not statistically significant 0.460>0.05 that means we do not reject the null hypothesis (β1=0) so there is no statistically significant relationship between mothers smoking status and baby birth weight.
Now we consider two independent variables Smoking (categorical) and Gest (continuous) and Weight (continuous) which is dependent variable and run SPSS.
Graphs→ Legacy Dialogs→ Scatter/Dot (choose simple Scatter) → Define (move Gest into the X-Axis box and Weight into the Y-Axis box and Smoking into the Set Markers by) → Ok.
The scatter plot indicates the linear positive relationship between Gest and Weight across both non-smoker and smoker mothers.
Coefficientsa | ||||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | 95.0% Confidence Interval for B | |||
B | Std. Error | Beta | Lower Bound | Upper Bound | ||||
1 | (Constant) | -2389.573 | 349.206 | -6.843 | .000 | -3103.779 | -1675.366 | |
Gest | 143.100 | 9.128 | .963 | 15.677 | .000 | 124.431 | 161.769 | |
Smoking | -244.544 | 41.982 | -.358 | -5.825 | .000 | -330.406 | -158.682 | |
a. Dependent Variable: Wieght |
With the table above we write the equation:
Weight=-2389.573 + 143.1 Gest – 244.544 Smoking
For smoker mothers, we plug 1 for Smoking in the equation and for the non-smoker mother we plug 0.
The P-Values for the slopes for both Gest and Smoking indicate that the null hypotheses are rejected and there is a relationship between Weight and Gest for non-smoker and smoker mothers.
Multiple Regression
The linear regression model with one independent and one dependent variable can be extended as the number of the independent variables increases to two or more. Hence, suppose there is one dependent variable “y” and k independent variables (x1, x2, …., xk), the following is the multiple regression equation with k independent variables:
y=β0+β1x1+β2x2+…+ βkxk
where β0 and βi (for i=1, 2 ,…, k) are the population parameters and can be estimated by the sample (as already explained for the simple linear regression). The point estimates for those parameters are denoted by b0 and bi (for i=1, 2 ,…, k) respectively. Then the equation for the multiple regression model would be:
ŷ=b0+b1x1+b2x2+ …+bkxk
where bi (for i=1, 2 ,…, k) are called partial regression coefficients and likewise simple linear regression we interpret each of them for example b1 that for each unit increasing in the independent variable (x1) the predicted value of response variable increases or decreases on average by b1 when the other value of independent variables are held constant.
Two independent variables are called collinear if they are correlated. In the multiple regression collinearity causes difficulty in model estimation and in fact having two or more explanatory variables that are highly correlated is pointless while we can explain the response variable with less explanatory variables. Hence, we try to avoid multicollinearity between the independent variables. Finally, in multiple regression we try to find a model with less explanatory variables which are less correlated and have high association with the response variable. For checking the multicollinearity, we can either have correlation table that is already mentioned (Analyze→ Correlate→ Bivariate→ Ok) or we can check collinearity diagnostic in the regression procedure.
Analyze → Regression → Linear (move the variables into the right box) → Statistics (collinearity diagnostic) → Continue → Ok.
Consider the coefficients table, in the last column collinearity statistics there are VFIs (variation inflation factor) that indicate how much the corresponding variance of an estimated regression coefficient is increased because of collinearity. The ideal value is less than 4 and greater that 10 is not good and indicates high multicollinearity.
Adjusted R2
In multiple regression adjusted R2 is used instead of R-square to identify the strength of a model fit since for adding each explanatory variable to the model R-square increases regardless of whether the explanatory variable is associated to the response variable.
Model selection
Stepwise model selection strategy
There are two way to execute the stepwise model selection strategy backward and forward based on P-Value or adjusted R2.
The backward technique
Consider a model with all the possible explanatory variables called full model which is not always the best model. The backward technique eliminates one explanatory variable at a time that is not statistically related to the response variable based on the P-Value or adjusted R2 until there is no obvious improvement in the adjusted R2.
- Based on the P-Value: Eliminate the variable with the highest P-Value in the full model until all the variables are statistically significant.
- Based on adjusted R2: Eliminate one variable at a time then refit the model and compare all the models based on their adjusted R2 and choose the model with the highest adjusted R2 and repeat until there is no obvious improvement in the adjusted R2 in the final model.
We might not always reach to the same final model using P-Value or adjusted R2 approach. The P-Value approach is recommended when we prefer having a simpler model and adjusted R2 approach is recommended when we only wish to predict the response value accurately.
The forward technique
In this technique, we do all the steps mentioned above in the opposite way which means we start with one variable and adding one variable at a time until reaching the ideal model.
And again, we might not reach to the same model using backward or forward however we can choose a model with high adjusted R2.
Note: For categorical variables with multiple levels we consider all the levels at a time if even one level is significant we keep the whole variable. Therefore, we can not eliminate some levels of variable and keep the others.
Assumptions for Multiple Regression
- Linearity for each explanatory variable and response variable (using scatter plot)
- Nearly normal residuals (using normal probability plot or histogram)
- Constant variability or homoscedastic (using residual plot)
- Independent residuals or observations
Example 24: consider the following variables:
The data (y, x1, x2, x3, x4) are by city.
Y = death rate per 1000 residents
x1 = doctor availability per 100,000 residents
x2 = hospital availability per 100,000 residents
x3 = annual per capita income in thousands of dollars
x4= population density people per square mile
Reference: Life In America’s Small Cities, by G.S. Thomas
Where y is dependent variable and x1, x2, x3, x4 are independent variables.
First, we check the assumptions:
For the linearity assumption, we run SPSS to generate the scatter plots.
Graphs→ Legacy Dialogs→ Scatter/Dot (choose Matrix Scatter) → Define (move all the variables into the Matrix variables box) → Ok.
The plot doesn’t show any specific pattern or curvature for the dependent variable against the independent variables (first row). The linearity assumption is almost met although for example the relationship between y and x2 seems to be weak.
We can also generate the residual plots with SPSS to support the linearity assumption.
Analyze→ Regression→ Linear (move dependent and independent variables into their boxes) →
Plots (move “*ZPRED” into the X box and “*ZRESID” into the Y box) and check normal probability plot → Continue → Ok.
The plot above indicates no pattern and curvature that supports the linearity assumption as well as independency since there is no obvious trend and autocorrelation but we can see two outliers under the mean line and one above the line. The homoscedasticity assumption is also almost met. The following outputs indicates normal p-p plot and histogram for residuals that approximately support the normality assumption.
We run the following procedure to check the multicollinearity.
Analyze → Regression → Linear (move the variables into the right box) → Statistics (collinearity diagnostic) → Continue → Ok.
Output:
Coefficientsa | ||||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | Collinearity Statistics | |||
B | Std. Error | Beta | Tolerance | VIF | ||||
1 | (Constant) | 12.266 | 2.020 | 6.072 | .000 | |||
Doctor Availability (x1) | .007 | .007 | .168 | 1.066 | .292 | .715 | 1.400 | |
Hospital Availability (x2) | .001 | .001 | .117 | .809 | .423 | .855 | 1.169 | |
Income (x3) | -.330 | .235 | -.214 | -1.408 | .166 | .775 | 1.290 | |
Population Density (x4) | -.009 | .005 | -.269 | -1.936 | .059 | .928 | 1.078 | |
a. Dependent Variable: Death Rate(y) |
The table above shows no multicollinearity since the VIFs are smaller than 4.
Now we can conduct the multiple regression by SPSS:
If we do not change the Method that means, we have the full model but in this example, we run the stepwise method.
Analyze → Regression → Linear (move the variables into the right boxes) and click Method and select Stepwise → Statistics (check Model fit, R squared change, Descriptives, Part and partial correlations, Collinearity diagnostic, and Casewise diagnostic) → Continue → Ok.
Outputs:
First, we look at the correlations table that shows the independent variables do not have multicollinearity since the correlations between them are small (red numbers) although the correlations between the dependent variable and independent variables are smaller than we expected that means the predictors do not predict the dependent variable very well. Hence, we expect very low adjusted R2.
Correlations | ||||||
Death Rate(y) | Doctor Availability (x1) | Hospital Availability (x2) | Income (x3) | Population Density (x4) | ||
Pearson Correlation | Death Rate(y) | 1.000 | .116 | .111 | -.172 | -.278 |
Doctor Availability (x1) | .116 | 1.000 | .296 | .433 | -.020 | |
Hospital Availability (x2) | .111 | .296 | 1.000 | .028 | .187 | |
Income (x3) | -.172 | .433 | .028 | 1.000 | .129 | |
Population Density (x4) | -.278 | -.020 | .187 | .129 | 1.000 | |
Sig. (1-tailed) | Death Rate(y) | . | .205 | .215 | .109 | .022 |
Doctor Availability (x1) | .205 | . | .016 | .001 | .444 | |
Hospital Availability (x2) | .215 | .016 | . | .423 | .090 | |
Income (x3) | .109 | .001 | .423 | . | .179 | |
Population Density (x4) | .022 | .444 | .090 | .179 | . | |
N | Death Rate(y) | 53 | 53 | 53 | 53 | 53 |
Doctor Availability (x1) | 53 | 53 | 53 | 53 | 53 | |
Hospital Availability (x2) | 53 | 53 | 53 | 53 | 53 | |
Income (x3) | 53 | 53 | 53 | 53 | 53 | |
Population Density (x4) | 53 | 53 | 53 | 53 | 53 |
Model Summaryb | |||||||||
Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | ||||
R Square Change | F Change | df1 | df2 | Sig. F Change | |||||
1 | .278a | .077 | .059 | 1.61277 | .077 | 4.259 | 1 | 51 | .044 |
a. Predictors: (Constant), Population Density (x4) | |||||||||
b. Dependent Variable: Death Rate(y) |
The table above indicates very low adjusted R2 as we expected (5.9%) which means 5.9% of the variation of the dependent variable explained by the independent variable so it is not a good model.
ANOVAa | ||||||
Model | Sum of Squares | df | Mean Square | F | Sig. | |
1 | Regression | 11.077 | 1 | 11.077 | 4.259 | .044b |
Residual | 132.652 | 51 | 2.601 | |||
Total | 143.728 | 52 | ||||
a. Dependent Variable: Death Rate(y) | ||||||
b. Predictors: (Constant), Population Density (x4) |
ANOVA table indicated that the slope of the line is not zero since the P-Value is less than 0.05.
In this table, we have hypotheses for the slope of the line that the null hypothesis that indicated the slope is zero is rejected.
Coefficientsa | ||||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | Collinearity Statistics | |||
B | Std. Error | Beta | Tolerance | VIF | ||||
1 | (Constant) | 10.388 | .569 | 18.245 | .000 | |||
Population Density (x4) | -.010 | .005 | -.278 | -2.064 | .044 | 1.000 | 1.000 | |
a. Dependent Variable: Death Rate(y) |
The table above indicates the final model has one independent variable (Population Density) that is more important and related to dependent variable with the P-Value of 0.044 less than 0.05 and is significant. That means the other independent variables are not statistically significant predictors. So the final model is:
Death Rate=10.388-0.010 ×Population Density
ANCOVA
ANCOVA or analysis of covariance is an analysis method which is a combination of regression and ANOVA. In the simple model, there are one continuous variable also called response variable or dependent variable, one categorical variable also called independent or explanatory variable or specifically treatment, and another explanatory variable which is continuous and has some effects on the response variable called covariate, and it is not our primary interest and should be controlled. ANCOVA determines whether the population means of the response variable across the levels of categorical variable are equal while the effect of control variable or covariate is considered into account. The assumptions for ANCOVA are almost the same as for ANOVA.
Assumptions:
- Normality
- Random independent samples
- Homogeneity of variances
- The relationship between the two continuous variables, dependent and covariate is linear.
- Homogeneity of regression slops (regression lines across the levels of categorical variable should be parallel)
Example 25: http://www.real-statistics.com/analysis-of-covariance-ancova/basic-concepts-ancova/
A school system is exploring four methods of teaching reading to their children, and would like to determine which method is best. It selects a random sample of 36 children and randomly divides them into four groups, using a different teaching method for each group. The reading score of each of the children after a month of training is given in following figure.
Before doing the analysis one of the researchers postulated that the scores of the children would be influenced by the income of their families, speculating that children from higher income families would do better on the reading tests no matter which teaching method was used, and so this factor should be taken into account when trying to determine which teaching method to use. The family income (in thousands of dollars) for each of the children in the study is also given in the same figure. Based on the data, is there a significant difference between the teaching methods?
We start with defining the variables in the Variable View window.
- Dependent variable is Score.
- Independent categorical variable or treatment is Method with 4 levels.
- Covariate variable is Income.
Then we check the assumptions. We don’t need to check the normality since the sample size is greater than 30 and homogeneity of variances is being checked while we run the ANCOVA but homogeneity of regression slopes we run the following procedure:
Analyze → General Linear Model → Univariate (move dependent, categorical, and covariate variables into the corresponding boxes) → Model (check Custom then move categorical variable (Method) and covariate (Income) and both simultaneously by holding control or shift button and click the arrow button) → Continue
Output:
Tests of Between-Subjects Effects | |||||
Dependent Variable: Reading Score | |||||
Source | Type III Sum of Squares | df | Mean Square | F | Sig. |
Corrected Model | 3286.329a | 7 | 469.476 | 8.137 | .000 |
Intercept | 113.137 | 1 | 113.137 | 1.961 | .172 |
Method | 90.570 | 3 | 30.190 | .523 | .670 |
Income | 1511.200 | 1 | 1511.200 | 26.191 | .000 |
Method * Income | 152.962 | 3 | 50.987 | .884 | .462 |
Error | 1615.560 | 28 | 57.699 | ||
Total | 24038.000 | 36 | |||
Corrected Total | 4901.889 | 35 | |||
a. R Squared = .670 (Adjusted R Squared = .588) |
As the table above shows the interaction between the categorical variable (Method) and the covariate (Income) is not significant (P-Value=0.462> 0.05) that means the homogeneity of regression slopes assumption is met. For checking linearity between the dependent and covariate variables we generate the scatter plot. The following scatter plot shows the linearity between the two variables. If there is no relationship between the covariate and dependent variable, we run the ANOVA without considering covariate.
With the following procedure, we run the ANCOVA:
Analyze → General Linear Model → Univariate → (move dependent, categorical, and covariate variables into the corresponding boxes) → Model → Full factorial → Continue → Options (move Factor (Methods) into the Display Means for box then check Compare main effects (select Bonferroni from Confidence interval adjustment) and check Descriptive Statistics, Homogeneity test, and Estimate of effect size) → Continue → Ok.
Output:
Levene’s Test of Equality of Error Variancesa | |||
Dependent Variable: Reading Score | |||
F | df1 | df2 | Sig. |
1.995 | 3 | 32 | .134 |
Tests the null hypothesis that the error variance of the dependent variable is equal across groups. | |||
a. Design: Intercept + Income + Method |
The table above shows that the homogeneity of variances assumption is met (P-Value=0.134>0.05).
Tests of Between-Subjects Effects | ||||||
Dependent Variable: Reading Score | ||||||
Source | Type III Sum of Squares | df | Mean Square | F | Sig. | Partial Eta Squared |
Corrected Model | 3133.367a | 4 | 783.342 | 13.731 | .000 | .639 |
Intercept | 332.495 | 1 | 332.495 | 5.828 | .022 | .158 |
Income | 1678.353 | 1 | 1678.353 | 29.419 | .000 | .487 |
Method | 571.030 | 3 | 190.343 | 3.336 | .032 | .244 |
Error | 1768.522 | 31 | 57.049 | |||
Total | 24038.000 | 36 | ||||
Corrected Total | 4901.889 | 35 | ||||
a. R Squared = .639 (Adjusted R Squared = .593) |
The P-Value is 0.032<0.05 that means the null hypothesis indicating the population means are equal is rejected. Hence, there is at least one method that the mean score is different from the others.
Descriptive Statistics | |||
Dependent Variable: Reading Score | |||
Methods | Mean | Std. Deviation | N |
1 | 22.8750 | 11.58123 | 8 |
2 | 33.7500 | 12.83689 | 8 |
3 | 21.9000 | 10.27889 | 10 |
4 | 15.8000 | 6.69660 | 10 |
Total | 23.0556 | 11.83444 | 36 |
Methods | ||||
Dependent Variable: Reading Score | ||||
Methods | Mean | Std. Error | 95% Confidence Interval | |
Lower Bound | Upper Bound | |||
1 | 23.783a | 2.676 | 18.326 | 29.240 |
2 | 30.054a | 2.756 | 24.433 | 35.675 |
3 | 21.333a | 2.391 | 16.457 | 26.209 |
4 | 18.597a | 2.444 | 13.614 | 23.581 |
a. Covariates appearing in the model are evaluated at the following values: Family Income = 48.8028. |
If we compare the two tables above, we can see the means from the first table are the real means and different from the second table since the covariate are considered into account. These means called adjusted means and ANCOVA compares them as opposed to ANOVA that compares the real group means.
Pairwise Comparisons | ||||||
Dependent Variable: Reading Score | ||||||
(I) Methods | (J) Methods | Mean Difference (I-J) | Std. Error | Sig.b | 95% Confidence Interval for Differenceb | |
Lower Bound | Upper Bound | |||||
1 | 2 | -6.271 | 3.871 | .692 | -17.180 | 4.638 |
3 | 2.450 | 3.593 | 1.000 | -7.677 | 12.576 | |
4 | 5.186 | 3.600 | .958 | -4.959 | 15.331 | |
2 | 1 | 6.271 | 3.871 | .692 | -4.638 | 17.180 |
3 | 8.721 | 3.629 | .135 | -1.507 | 18.948 | |
4 | 11.457* | 3.777 | .029 | .810 | 22.103 | |
3 | 1 | -2.450 | 3.593 | 1.000 | -12.576 | 7.677 |
2 | -8.721 | 3.629 | .135 | -18.948 | 1.507 | |
4 | 2.736 | 3.434 | 1.000 | -6.943 | 12.415 | |
4 | 1 | -5.186 | 3.600 | .958 | -15.331 | 4.959 |
2 | -11.457* | 3.777 | .029 | -22.103 | -.810 | |
3 | -2.736 | 3.434 | 1.000 | -12.415 | 6.943 | |
Based on estimated marginal means | ||||||
*. The mean difference is significant at the .05 level. | ||||||
b. Adjustment for multiple comparisons: Bonferroni. |
The table above shows the difference between the means of method 2 and 4 is statistically significant.
Logistic Regression
In logistic regression, we investigate the relationship between a categorical response variable (y) and explanatory variables that can be categorical or numerical variables. If the predicted or response variable has two levels called binary logistic regression since the variable has two outcomes for example passing or fail or yes or no (dichotomous). Likewise, for nominal predicted variable with more than two levels we use nominal logistic regression and for ordinal predicted variable with more than two levels we use ordinal logistic regression.
Binary Logistic Regression
As mentioned above in logistic regression we have dichotomous binary categorical dependent variable and either categorical or continuous independent variables which the dependent variable is not normally distributed as opposed to linear regression.
In this case, logistic regression enables us to find the probability and the odds of the response variable at the victory level (yi=1). Suppose the predicted variable (yi) for ith observation which takes “1” with the probability of pi (success probability) and takes “0” with the probability of 1-pi while we have k predictor variables x1i, x2i, …, xki. the following equation indicates the logistic regression model between the probability pi and predictor variables.
- Logit(pi)=ln pi1-pi =0+1 x1i+2 x2i+ …+k xki ; i=1, 2, 3, …, n (sample size)
Where for example, x1i is the value of first explanatory variable for ith observation. So we solve the equation above for pi to find the estimated regression equation:
pi=e0+1 x1i+2 x2i+ …+k xki1+e0+1 x1i+2 x2i+ …+k xki
Note that we always have: 0≤ pi≤1
Where for the sample we substitute point estimates for the coefficients and pi so:
(2) pi=eb0+b1 x1i+b2 x2i+ …+bk xki1+eb0+b1 x1i+bx2i+ …+bk xki
With the equation above we can estimate the probability and the odds of response variable when it takes the value of “1” or we can determine the probability of the response variable at the level of “1” for the different values of explanatory variables and finally describe if one explanatory variable increases one unit when the others are held constant how much the probability of response variable at the level of “1” increases.
Assumptions:
- Dependent variable (y) is a binary variable and for a sample of n, we have n (y1,y2, …, yn) independent observations which are binomially distributed (yi~B(ni,pi)).
- There should be a linear relationship between the logit transformation of the dependent variable and each explanatory variable if all others are held constant as you can see in the equation (1).
- To estimate the parameters the maximum likelihood estimation (MLE) instead of the least squares estimation.
- The homogeneity of variances does not need to be met.
Residual
As we mentioned earlier residuals in multiple regression are the differences of the observed response values and the predicted response values (ei=yi– ŷi ) hence, for logistic regression, the estimated probability for the observations is used instead of predicted response values and therefore the equation for residuals is: ei=yi– pi
The Odds
In general, “odds” refers to the ratio of the probability of event (A) occurring to the probability of it not occurring.
odds(A)=p1-p ; when 0≤ p≤1 and P(A)= p
If you consider the equation (1) we can have: Logit(pi)=ln (odds (y=1)) = ln pi1-pi .
The Odds Ratio
The odds ratio is a ratio of the two odds of two events.
odds ratio= odds(A)odss(B) =k, which means the odds of occurring A is k times greater than occurring B. (if K>1).
In logistic regression for increasing one unit in an independent variable the odds ratio indicates how the odds change while other variables are held constant and in the output the odds ratio for each independent variable is indicated by Exp(B).
Example 26: https://onlinecourses.science.psu.edu/stat501/node/374
Consider data published on n = 27 leukemia patients the data has a response variable of whether leukemia remission occurred (REMISS) (y), which is given by a 1.
The predictor variables are cellularity of the marrow clot section (CELL) (x1), smear differential percentage of blasts (SMEAR) (x2), percentage of absolute marrow leukemia cell infiltrate (INFIL) (x3), percentage labeling index of the bone marrow leukemia cells (LI) (x4), absolute number of blasts in the peripheral blood (BLAST) (x5), and the highest temperature prior to start of treatment (TEMP) (x6).
SPSS:
We define the variables above in this example we have one dependent variable (REMISS) that is binary and takes value of “0” for no (not occurring remission) and “1” for yes (occurring remission) and six independent variables mentioned above.
Analyze → Regression → Binary Logistic (move dependent variable into the corresponding box and the independent variables into the covariates box (if there is any categorical independent variable then → categorical (move it to the Categorical Covariates box)) → Save (check for example Probabilities and Group membership) → Continue → Option (check Classification plots, Hosmer-Lemeshow goodness-of-fit, and CI for exp(B)) → Continue → Ok
Output:
Model Summary | |||
Step | -2 Log likelihood | Cox & Snell R Square | Nagelkerke R Square |
1 | 21.594a | .377 | .524 |
a. Estimation terminated at iteration number 9 because parameter estimates changed by less than .001. |
The above shows R squared value is 0.524 which means almost 52% of variation in the dependent variable is explained by independent variables.
Hosmer and Lemeshow Test | |||
Step | Chi-square | df | Sig. |
1 | 4.612 | 7 | .707 |
The H and L test (Hosmer and Lemeshow Test) is indicated by the table above and is not statistically significant which we wish.
Variables in the Equation | |||||||||
B | S.E. | Wald | df | Sig. | Exp(B) | 95% C.I.for EXP(B) | |||
Lower | Upper | ||||||||
Step 1a | Cell | 30.830 | 52.135 | .350 | 1 | .554 | 24508982903992.880 | .000 | 5.848E+57 |
Smear | 24.686 | 61.526 | .161 | 1 | .688 | 52617581214.632 | .000 | 1.237E+63 | |
INFIL | -24.974 | 65.281 | .146 | 1 | .702 | .000 | .000 | 5.260E+44 | |
LI | 4.360 | 2.658 | 2.691 | 1 | .101 | 78.293 | .428 | 14328.474 | |
Blast | -.012 | 2.266 | .000 | 1 | .996 | .989 | .012 | 83.967 | |
Temp | -100.173 | 77.753 | 1.660 | 1 | .198 | .000 | .000 | 47724943496356340000000.000 | |
Constant | 64.258 | 74.965 | .735 | 1 | .391 | 8071074741646341000000000000.000 | |||
a. Variable(s) entered on step 1: Cell, Smear, INFIL, LI, Blast, Temp. |
As the table above shows almost all the independent variables don’t have significant effect on the dependent variable. The test is used in logistic regression for the significance of individual regression coefficients is Wald test. Hence, based on Wald test β1 = β2 = β3= β5= β6= 0 since their P-Values are great enough to not rejecting the null hypotheses and we only keep the LI variable which has the smallest P-Value. So, we run SPSS again, this time with only one independent variable which is “LI”:
Output:
Variables in the Equation | |||||||||
B | S.E. | Wald | df | Sig. | Exp(B) | 95% C.I.for EXP(B) | |||
Lower | Upper | ||||||||
Step 1a | LI | 2.897 | 1.187 | 5.959 | 1 | .015 | 18.124 | 1.770 | 185.563 |
Constant | -3.777 | 1.379 | 7.506 | 1 | .006 | .023 | |||
a. Variable(s) entered on step 1: LI. |
Table above indicates the P-Value=0.015 is statistically significant so we reject the null hypothesis (1=0) and we can write the regression equation:
pi=e-3.777+2.897x1i1+e-3.777+2.897x1i
The odds ratio for the predictor variable (LI) is “Exp(B)=Exp(2.897)=18.124” which means for increasing one unit in LI the odds of remission is multiplied by 18.124 or the value for odds ratio or for Exp(B) for the odds of Leukemia reoccurring is reciprocal of 18.124 which is 1/18.124. Since the LI range is between 0 to 2 it makes more sense to increase 0.1 unit in LI so the odds of remission are multiplied by Exp(2.897×0.1)=1.336 which means by increasing 0.1 unit in LI the odds of remission 33% increases.
Model Summary | |||
Step | -2 Log likelihood | Cox & Snell R Square | Nagelkerke R Square |
1 | 26.073a | .265 | .368 |
a. Estimation terminated at iteration number 5 because parameter estimates changed by less than .001. |
As the table above indicates 36% of variation in the dependent variable explained by the independent variable.
Classification Tablea | |||||
Observed | Predicted | ||||
Occurring Remission (y) | Percentage Correct | ||||
No | Yes | ||||
Step 1 | Occurring Remission (y) | No | 16 | 2 | 88.9 |
Yes | 5 | 4 | 44.4 | ||
Overall Percentage | 74.1 | ||||
a. The cut value is .500 |
The table above shows the model is correctly classified of 74.1% cases.
Omnibus Tests of Model Coefficients | ||||
Chi-square | df | Sig. | ||
Step 1 | Step | 8.299 | 1 | .004 |
Block | 8.299 | 1 | .004 | |
Model | 8.299 | 1 | .004 |
The table above indicates that the logistic regression model is statistically significant and will be a good predictor since the P-Value is 0.004 is very low.
Glossary:
Data Analysis:
A statistical hypothesis test is a method of statistical inference using data from a scientific study. In statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level.
Chi-squared test determines whether a given form of frequency curve will effectively describe the samples drawn from a given population.
‘Student’s‘ t Test is one of the most commonly used techniques for testing a hypothesis on the basis of a difference between sample means. Explained in layman’s terms, the t test determines a probability that two populations are the same with respect to the variable tested.
By definition, “null hypothesis” assumes that any kind of difference you see in a set of data is due to chance, and that no statistical significance exists in a set of given observations (a simple variable is no different than zero).
Fisher’s Test of Significance is used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance; this can help to decide whether results contain enough information to cast doubt on conventional wisdom, given that conventional wisdom has been used to establish the null hypothesis.
Degree of freedom – (Walker 1940) – “the number of observations minus the number of necessary relations among these observations.”
Chi-square – distribution is the sum of the squares of independent standard normal random variables. The chi-square test is used for goodness and fit of the observed distribution. It typically measures and summarizes the discrepancy between observed values expected under the model in question.
Randomness – A numeric sequence that has no recognizable pattern or regularities.
Statistical Process – When the process evolves in many or infinite directions.
Deterministic System – When the process evolves in only one direction.
Probability – A measure of likeliness that an event will occur. Probability is quantified as a number between 0-1 (0 indicating impossibility and 1 certainty). ;
Pierre-Simon Laplace (1774-1778):
- The frequency of an error could be expressed as an exponential function of the numerical magnitude of the error.
- The frequence of an error is an exponential function of the square of the error. This low is called numerical distribution and is reflected as the Gauss Curve of Distribution.
Probability Theory – The branch of mathematics concerned with the analysis of random phenomenon. The central objectives of probability theory are:
- Random variables
- Statistical process
If a roll of a die is considered a random event, if repeated several times the sequence of random events will show a pattern of events that can be studied and predicted.
Variable –The name or alphabetic character that represents a number or a value. It could be known or unknown. Rene Descartes called the unknown variable X,Y, Z and known variable a, b, c. This is still used today.
In statistical terms, probability increases when the values are closer to the expected mean in a normal distribution.
Scatter plots – are used in descriptive statistics to show the observed relationships between different variables.
Descriptive statistics – is the essential step for analysing observational or experimental data and whether the observation is a complete survey or a sample survey.
Randomness – A numeric sequence that has no recognizable pattern or regularities.
Statistical Process – When the process evolves in many or infinite directions.
Deterministic System – When the process evolves in only one direction.
Probability – A measure of likeliness that an event will occur. Probability is quantified as a number between 0-1 (0 indicating impossibility and 1 certainty). ;
Pierre-Simon Laplace (1774-1778):
- The frequency of an error could be expressed as an exponential function of the numerical magnitude of the error.
- The frequence of an error is an exponential function of the square of the error. This low is called numerical distribution and is reflected as the Gauss Curve of Distribution.
Probability Theory:
The branch of mathematics concerned with the analysis of random phenomenon. The central objectives of probability theory are:
- Random variables
- Statistical process
If a roll of a die is considered a random event, if repeated several times the sequence of random events will show a pattern of events that can be studied and predicted.
Variable:
The name or alphabetic character that represents a number or a value. It could be known or unknown. Rene Descartes called the unknown variable X,Y, Z and known variable a, b, c. This is still used today.
In statistical terms, probability increases when the values are closer to the expected mean in a normal distribution.
Scatter plots are used in descriptive statistics to show the observed relationships between different variables.
Descriptive statistics is the essential step for analysing observational or experimental data and whether the observation is a complete survey or a sample survey.
The next practical step after the descriptive statistics is inferential statistics, which recognizes the pattern and makes inferences. These inferences may include answering the scientific question, assumptions about accepting or rejecting null hypotheses, estimating numerical characteristics of the data, describing associations within the data, correlations within the data, extending the forecasting and predictions through regression analysis and others.