

STATISTICAL PAGES 

Year : 2015  Volume
: 1
 Issue : 1  Page : 6971 

Chisquare test and its application in hypothesis testing
Rakesh Rana, Richa Singhal
Statistical Section, Central Council for Research in Ayurvedic Sciences, Ministry of AYUSH, GOI, New Delhi, India
Date of Web Publication  22May2015 
Correspondence Address: Dr. Richa Singhal Central Council for Research in Ayurvedic Sciences, Ministry of AYUSH, GOI, New Delhi India
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/23955414.157577
In medical research, there are studies which often collect data on categorical variables that can be summarized as a series of counts. These counts are commonly arranged in a tabular format known as a contingency table. The chisquare test statistic can be used to evaluate whether there is an association between the rows and columns in a contingency table. More specifically, this statistic can be used to determine whether there is any difference between the study groups in the proportions of the risk factor of interest. Chisquare test and the logic of hypothesis testing were developed by Karl Pearson. This article describes in detail what is a chisquare test, on which type of data it is used, the assumptions associated with its application, how to manually calculate it and how to make use of an online calculator for calculating the Chisquare statistics and its associated Pvalue. Keywords: Categorical data analysis, Chisquare test, hypothesis testing, online calculator
How to cite this article: Rana R, Singhal R. Chisquare test and its application in hypothesis testing. J Pract Cardiovasc Sci 2015;1:6971 
The logic of hypothesis testing was first invented by Karl Pearson (18571936), a renaissance scientist, in Victorian London in 1900. ^{[1]} Pearson's Chisquare distribution and the Chisquare test also known as test for goodnessoffit and test of independence are his most important contribution to the modern theory of statistics. The importance of Pearson's Chisquare distribution was that, the statisticians could use the statistical methods that did not depend on the normal distribution to interpret the findings. He invented the Chisquare distribution to mainly cater the needs of biologists, economists, and psychologists. His paper in 1900 published in Philosophical magazine elaborates the invention of Chisquare distribution and goodness of fit test. ^{[2],[3]}
Chisquare test is a nonparametric test used for two specific purpose: (a) To test the hypothesis of no association between two or more groups, population or criteria (i.e. to check independence between two variables); (b) and to test how likely the observed distribution of data fits with the distribution that is expected (i.e., to test the goodnessoffit). It is used to analyze categorical data (e.g. male or female patients, smokers and nonsmokers, etc.), it is not meant to analyze parametric or continuous data (e.g., height measured in centimeters or weight measured in kg, etc.).
For example if we want to test that in a health camp attended by 50 persons the one who exercise regularly are having lesser body mass index (BMI) by taking their actual BMI values, than it cannot be tested using a Chisquare test. However, if we divide the same set of 50 persons into two categories as obese with BMI ≥ 30 and nonobese with BMI < 30, than the same data can be tested using a Chisquare test by counting the number of obese and nonobese persons across two groups, the one who exercise regularly and the one who does not. A 2x2 contingency table also known as cross tables can be constructed for calculating a Chisquare statistic [Table 1].
Assumptions Underlying a Chisquare Test   
 The data are randomly drawn from a population
 The values in the cells are considered adequate when expected counts are not <5 and there are no cells with zero count ^{[4],[5]}
 The sample size is sufficiently large. The application of the Chisquare test to a smaller sample could lead to type II error (i.e. accepting the null hypothesis when it is actually false). There is no expected cutoff for the sample size; however, the minimum sample size varies from 20 to 50
 The variables under consideration must be mutually exclusive. It means that each variable must only be counted once in a particular category and should not be allowed to appear in other category. In other, words no item shall be counted twice.
How to Calculate a Chisquare Statistics?   
The formula for calculating a Chisquare statistic is:
Where,
O stands for the observed frequency,
E stands for the expected frequency.
Expected count is subtracted from the observed count to find the difference between the two. Then the square of the difference is calculated to get rid of the negative vales (as the squares of 2 and −2 are, of course, both 4). Then the square of the difference is divided by the expected count to normalize bigger and smaller values (because we don't want to get bigger Chisquare values just because we are working on large data sets). The sigma sign in front of them denotes that we have, to sum up, these values calculated for each cell.
As an example, suppose we want to find out that whether there is an association between smoking and lung disease.
The null and alternative hypothesis will be:
H _{0} : There is no association between smoking and lung disease.
H _{1} : There is an association between smoking and lung disease.
The basic step for calculating a Chisquare test is setting up a 2 × 2 contingency table [Table 2].  Table 2: General notation for a 2×2 contingency table (observed values for the data)
Click here to view 
The general formula for calculating the expected counts from observed count for a particular cell is [(corresponding row total * corresponding column total) /Total no. of patients] [Table 3].
Before we proceed further, we need to know how many degrees of freedom (df) we have. When a comparison is made between one sample and another, a simple rule is that the df equals (number of columns − 1) × (number of rows − 1) excluding the rows and column containing the total. Hence, in our example df = (2−1) × (2−1) = 1.
Hypothetical data for calculating the Chisquare test for our example of testing an association between smoking and lung disease is given in [Table 4]. Chisquare test can be calculated manually by using the formula described above. Refer [Table 5] and [Table 6] for manual calculations. Chisquare value for our example as shown in [Table 6] is 3.42, df = 1. If we want to test our hypothesis at 5% level of significance than our predetermined alpha level of significance is 0.05. Looking into the Chisquare distribution table [Table 7] with 1 degree of freedom and reading along the row we find our value of χ^{2} (3.42) lies between 2.706 and 3.841. The corresponding probability is between the 0.10 and 0.05 probability levels. That means that the P value is above 0.05 (it is actually 0.065). Since a P value of 0.065 is greater than the conventionally accepted significance level of 0.05 (i.e., P > 0.05) we fail to reject the null hypothesis or in other words we accept our null hypothesis and conclude that there is no association between smoking and lung disease.  Table 4: Hypothetical data containing observed values for calculating Chisquare statistics
Click here to view 
How to Use a Chisquare Distribution Table to Approximate P Value?   
Scientists and statisticians use large tables of values to calculate the P value for their experiment. These tables are generally set up with the vertical axis on the left corresponding to df and the horizontal axis on the top corresponding to P value. Use these tables by first finding our df, then reading that row across from the left to the right until we find the first value bigger than our Chisquare value. Look at the corresponding P value at the top of the column. Chisquare distribution tables are available from a variety of sourcesthey can easily be found online or in science and statistics textbooks.
Using an Online Chisquare Calculator   
The Chisquare statistics and its associated P value can be calculated through online calculators also which are easily available on the internet. For userfriendly online calculator, you may visit this uniform resource locator www.socscistatistics.com/tests/chisquare/default2.aspx. Many more online calculators are available on the World Wide Web. The basic step for using an online calculator is to correctly fill in your data into it.
Step by step procedure of using an online calculator is described below:
 Step 1: For our example of finding an association between smoking and lung disease we have to fill in the observed values in the cells of an online calculator [Figure 1]
 Figure 1: Setting up the data in the 2 × 2 table of an online calculator.
Click here to view 
 Step 2: Click on the next button. Another screen will pop up as shown in [Figure 2]
 Step 3: Click on the Calculate Chi^2 button. And you are done with your calculation Output of the Chisquare test will be as shown in [Figure 3].
The image above shows the Chisquare value as 3.4177 and its associated P value as 0.0645 which is actually greater than P value of 0.05, hence no significant difference has been observed. To conclude, there is no association between smoking and lung disease.
What Does a Chisquare Test Tell and What it Does Not?   
It may be clearly understood that Chisquare test only tells us the probability of independence of a distribution of data or in simple terms it will only test that whether two variables are associated with each other or not. It will not tell us that how closely they are associated. For instance in the above example, the Chisquare test will only tell us that whether there is any relation between smoking and lung disease. It will not tell us that how likely it is, that smokers are prone to lung disease. However, once we got to know that there is a relation between these two variables, we can explore other methods to calculate the amount of association between them.
References   
1.  Magnello ME Karl Pearson and the origin of modern statistics: An elastician becomes a statistician, Rutherford J, Vol. 1, 20052006. Available online at: http://rutherfordjournal.org. 
2.  Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos Mag Ser 1900;50:15775. 
3.  Plackett RL. Karl Pearson and the Chisquared test. Int Stat Rev 1983;51:5972. 
4.  Yates F, Moore D, McCabe G. The Practice of Statistics 1 ^{st} ed. New York: W.H.Freeman, 1999. 
5.  Yates F. Contingency table involving small numbers and the Chisquared test. Suppl J R Stat Soc 1934;1:21735. 
[Figure 1], [Figure 2], [Figure 3]
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6], [Table 7]
This article has been cited by  1 
HARP 

 Yashen Wang,Huanhuan Zhang   ACM Transactions on Knowledge Discovery from Data. 2021; 15(2): 1   [Pubmed]  [DOI]   2 
Patients 80?+ have similar medication initiation rates to those aged 50–79 in Ontario FLS 

 J. E. M. Sale,A. Yang,V. ElliotGibson,R. Jain,R. Sujic,D. Linton,J. Weldon,L. Frankel,E. Bogoch   Osteoporosis International. 2021;   [Pubmed]  [DOI]   3 
Machine LearningBased Application for Predicting Risk of Type 2 Diabetes Mellitus (T2DM) in Saudi Arabia: A Retrospective CrossSectional Study 

 Asif Hassan Syed,Tabrej Khan   IEEE Access. 2020; 8: 199539   [Pubmed]  [DOI]   4 
Test of Association in the Presence of Complex Environment 

 Muhammad Aslam,Osama H. Arif   Complexity. 2020; 2020: 1   [Pubmed]  [DOI]   5 
Understanding Discharge Voltage Inconsistency in LithiumIon Cells via Statistical Characteristics and Numerical Analysis 

 Linshu Wang,Yanyan Fang,Lve Wang,Fengling Yun,Jiantao Wang,Shigang Lu   IEEE Access. 2020; 8: 84821   [Pubmed]  [DOI]   6 
Machine learning predicts livebirth occurrence before invitro fertilization treatment 

 Ashish Goyal,Maheshwar Kuchana,Kameswari Prasada Rao Ayyagari   Scientific Reports. 2020; 10(1)   [Pubmed]  [DOI]   7 
Customers perception on logistics service quality using Kansei engineering: empirical evidence from indonesian logistics providers 

 Dian Palupi Restuputri,Ilyas Masudin,Citra Permata Sari,Albert W. K. Tan   Cogent Business & Management. 2020; 7(1)   [Pubmed]  [DOI]   8 
A Novel RiskBased Prioritization Approach for Wireless Sensor Network Deployment in Pipeline Networks 

 Xiaojian Yi,Peng Hou,Haiping Dong   Energies. 2020; 13(6): 1512   [Pubmed]  [DOI]   9 
Fishers’ Perceptions and Attitudes toward Weather and Climate Information Services for Climate Change Adaptation in Senegal 

 Ndèye Seynabou Diouf,Issa Ouedraogo,Robert B. Zougmoré,Madické Niang   Sustainability. 2020; 12(22): 9465   [Pubmed]  [DOI]   10 
Over 40% of 450 registered wheat cultivars (Triticum aestivum) worldwide are composed of multiple biotypes 

 Eugene Metakovsky,Viktor Melnik,Laura Pascual,Colin W. Wrigley   Journal of Cereal Science. 2020; : 103088   [Pubmed]  [DOI]   11 
Analyzing expectation of landowners from Brownfield TOD project: A case study from MP Nagar, Bhopal 

 Manmeet C. Verma,Rahul Tiwari   Materials Today: Proceedings. 2020;   [Pubmed]  [DOI]   12 
Gliadin genotypes worldwide for spring wheats (Triticum aestivum L.) 1. Genetic diversity and grainquality gliadin alleles during the 20th century 

 E. Metakovsky,V.A. Melnik,L. Pascual,C.W. Wrigley   Journal of Cereal Science. 2019; 87: 172   [Pubmed]  [DOI]   13 
Types, frequencies and value of intravarietal genotypic nonuniformity in common wheat cultivars: Authentic biotypes and foreign seeds 

 Eugene Metakovsky,Viktor Melnik,Laura Pascual,Georgy A. Romanov,Colin W. Wrigley   Journal of Cereal Science. 2019; : 102813   [Pubmed]  [DOI]   14 
Embracing Sustainability in Shipping: Assessing Industry’s Adaptations Incited by the, Newly, Introduced ‘triple bottom line’ Approach to Sustainable Maritime Development 

 C.W. Fasoulis,C.W. Rafet   Social Sciences. 2019; 8(7): 208   [Pubmed]  [DOI]   15 
Gliadin genotypes worldwide for spring wheats (Triticum aestivum L.) 2. Strong differentiation of polymorphism between countries and regions of origin 

 E. Metakovsky,V.A. Melnik,L. Pascual,C.W. Wrigley   Journal of Cereal Science. 2019; 87: 311   [Pubmed]  [DOI]   16 
There is a direct link between allantoin concentration and cadmium tolerance in Arabidopsis 

 Maryam Nourimand,Christopher D. Todd   Plant Physiology and Biochemistry. 2018;   [Pubmed]  [DOI]   17 
Radiographer mammographersæ attitudes towards implementing new techniques for imaging the augmented breast, after viewing a training DVD or receiving cascade training: A survey 

 N.C. Moneme,J. Curtis   Radiography. 2018;   [Pubmed]  [DOI]   18 
Exploring student learning profiles in algebrabased studio physics: A personcentered approach 

 Jarrad W.?T. Pond,Jacquelyn J. Chini   Physical Review Physics Education Research. 2017; 13(1)   [Pubmed]  [DOI]  



