

CURRICULUM IN CARDIOLOGY  STATISTICS 

Year : 2018  Volume
: 4
 Issue : 2  Page : 116121 

Correlation analysis in biological studies
Suniti Yadav
Department of Anthropology, University of Delhi, New Delhi, India
Date of Web Publication  10Sep2018 
Correspondence Address: Suniti Yadav Molecular Anthropology Laboratory, Department of Anthropology, University of Delhi, New Delhi  110 007 India
Source of Support: None, Conflict of Interest: None  Check 
DOI: 10.4103/jpcs.jpcs_31_18
Correlation is a statistical procedure to test the relationship between quantitative variables and categorical variables. In other words, it describes the degree of relation between two variables. It is one of the most commonly used statistical techniques. The present article is based on selected statistical textbook, review of the literature, and our own research experience study. Keywords: Correlation, parametric, risk factorsphenotype relation
How to cite this article: Yadav S. Correlation analysis in biological studies. J Pract Cardiovasc Sci 2018;4:11621 
Introduction   
The concept of correlation was first proposed by Sir Francis Galton in 1894, which was further mathematically described by Karl Pearson in 1896.^{[1]} Correlation analysis is a method of statistical evaluation of the strength of a relationship between two numerically measurable continuous variables.
In biostatistics, univariate statistical tests such as Chisquare test, Fisher's exact test, ttest, and analysis of variance do not allow taking into account the effect of other covariates/confounders during analyses.^{[2]} However, a technique called partial correlation allows the researcher to control the effect of confounders/covariates in understanding the relation between the two selected variables.^{[3]} Partial correlation looks at the relationship between two variables while removing the effects of other variables.
In statistical terms, correlation is a method of assessing a probable twoway linear association between two measurable continuous variables. The extent of “correlation” is measured by a statistic called the correlation coefficient, which represents the strength of the putative linear association between the two selected variables. In other words, it is a statistic representing how closely two variables covary; it is a dimensionless quantity whose value can vary from −1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation).^{[4]} A positive coefficient of correlation indicates that the variables are directly related, i.e., as the value of one variable increases, the value of the other variable also tend to increase. On the contrary, if the coefficient is a negative number, it indicates that the selected variables are negatively related, i.e., as the value of one variable increases, the value of other tends to decrease. In statistical terms, any other form of relation between any two continuous variables that is not linear is not considered as correlation.^{[5]}
In biological research, the relation between independent or the predictor variables and outcome or the dependent variable is explored. This explains how the risk factors or the predictor variables account for the possibility of the occurrence of a disease or presence of a phenotype. The disease outcome or the dependent variable is associated with biological factors (such as age and gender), lifestyle variables (such as physical activity, smoking, and alcohol consumption), physiological variables (blood pressure and pulse rate), and genetic factors (genetic mutations). To understand such “risk factors–disease” relationship, two tests may be used, i.e., correlation and regression (Gaddis and Gaddis, 1990). Correlation thus provides a quantitative way of measuring the degree or strength of the relation between the selected variables, whereas regression describes this relation mathematically by predicting the value of the outcome occurrence based on the independent predictor value.^{[6]}
Types of Correlation   
Pearson's r correlation
When there is normal distribution of the data or the data are “parametric,” Pearson's correlation “r” is used. It is used between the variables that are linear. Pearson's r correlation is calculated using the following formula:
where r = Pearson's r correlation coefficient
N = number of observations
Σxy = sum of the products of paired scores
Σx = sum of x scores
Σy = sum of y scores
Σx^{2} = sum of squared x scores
Σy^{2} = sum of squared y scores.
For the Pearson's r correlation, both variables should be normally distributed (bellshaped curve I distribution) and have linearity. Linearity assumes a straight line relationship between each of the two variables.
Spearman's rank correlation
Spearman's rank correlation is a nonparametric test used to measure the degree of association between two variables. When the data or the distribution of the selected variables is not normally distributed or “skewed,” Spearman's rank correlation may be used. This test of correlation does not carry any assumptions about the distribution of the data and is used best when the variables are measured on a scale that is at least ordinal and the scores on one variable need to be monotonically related to the other variable.
Spearman's rank correlation is calculated using the following formula:
where ρ = Spearman's rank correlation
d_{i} = the difference between the ranks of corresponding variables
n = number of observations.
Statistical Simulations to Understand the Relationship between Correlation Coefficient and Scatterplots   
The scatterplot between the selected variables can present their relationship. The higher the correlation between the selected variables, the more is the linear association between them and hence an obvious trend is seen in a scatter plot [Figure 1].
For example, the data depicted in [Figure 2], [Figure 3], [Figure 4], [Figure 5] have been simulated from a bivariate normal distribution of 500 observations with means 2 and 3 for the variables x and y, respectively (Figure source – Mukaka 2012).  Figure 2: Scatterplot of variables x and y; Pearson's correlation = 0.20.
Click here to view 
 Figure 3: Scatterplot of variables x and y; Pearson's correlation = 0.50.
Click here to view 
 Figure 4: Scatterplot of variables x and y; Pearson's correlation = 0.80.
Click here to view 
 Figure 5: Scatterplot of variables x and y; Pearson's correlation = 0.80.
Click here to view 
The scatterplot in [Figure 2] shows a linear association trend between the variables x and y, but the trend does not seem to be clear since the coefficient of correlation is low, i.e., 0.20. The trend seems to improve in [Figure 3], where the coefficient of correlation is 0.50. The trend in [Figure 4] and [Figure 5] shows that, higher the correlation in either direction, i.e., positive correlation or negative correlation, the more linear association is visible in the scatterplot. The strength of the correlation between x and y in [Figure 4] and [Figure 5] remains same but in opposite direction. In [Figure 4], when x increases, y also increases, whereas in [Figure 5], when x increases, y decreases or vice versa.
Interpretation of the size of correlation coefficient
The correlation coefficient value may be interpreted from negligible to high positive/negative as shown in [Table 1] (Hinkle et al., 2003).
Coefficient of Correlation (R) and Coefficient of Determination (R^{2})   
Coefficient of correlation (r) is the degree of relationship between two variables, i.e., x and y, whereas coefficient of determination (R^{2}) shows percentage variation in y which is explained by all the x variables together. The value of “r” may vary from −1 to +1, whereas the value of “r^{2}” lies between 0 and +1.
Use of Correlation Analysis in Biological Data   
In biological research, correlation analysis is used to understand the relation between the independent variables (or risk factors) with dependent variable (or the disease outcome). The selected variables may be continuous or ordinal. For example, to know the relation between systolic blood pressure (SBP) (continuous dependent) and risk factors/independent variables such as age (continuous) and weight (continuous), Pearson's correlation analysis would be used. On the contrary, to understand the relation between maternal age (continuous) and parity (ordinal) or number of hospitalization (ordinal) and history of stroke (ordinal), Spearman's correlation analysis would be used.
How to Perform Correlation in SPSS?   
Linear regression can be tested through the SPSS statistical software (IBM SPSS Statistics for Windows, IBM Corp., Released 2011, Version 20.0, Armonk, NY, USA) in five steps to analyze data using linear regression. Following is the procedure followed [Table 1], [Table 2], [Table 3], [Table 4].  Table 2: Bivariate (Pearson) correlation analysis between systolic blood pressure and waist circumference
Click here to view 
 Table 3: Bivariate (Spearman) correlation analysis between systolic blood pressure and waist circumference
Click here to view 
 Table 4: Partial correlation between systolic blood pressure and waist circumference controlling for smoking and education
Click here to view 
Click Analyze > Correlate > Bivariate > select variables > select correlation coefficient > select test of significance (keep two tailed) > flag significant correlations (box checked) > OK (enter).
Example 1: Data (n = 967) on the waist circumference (WC) and SBP were collected and bivariate correlation would be tested to understand the relation between the two.
Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.395, P < 0.001 [Table 2], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and SBP. The coefficient of determination, i.e., R^{2} is 0.156 ([0.395^{2}]), which implies that WC accounts for only 15.6% variation in the SBP.
Example 2: Data (n = 936) on the WC and the body mass index (BMI) status were collected. BMI status was categorized into underweight, normal, overweight, and obese. Bivariate correlation would be tested to understand the relation between the two.
Since one of the selected variables is continuous (WC), while other is ordinal (BMI status), bivariate correlation analysis is performed using Spearman's correlation coefficient after checking the normality assumptions for both variables. The Spearman's correlation coefficient, i.e., r = 0.398, P < 0.001 [Table 3], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and BMI status. The coefficient of determination, i.e., R^{2} is 0.158 ([0.398^{2}]), implies that BMI status explains 15.8% variation in the WC.
Correlation analysis can also be used for calculating independent correlation between variables adjusting for the effect of other variables. Such analysis can be done using partial correlation analysis in SPSS. The following command is given:
Click Analyze > Correlate > Partial > select variables > select controlling for (variables) > select test of significance (keep two tailed) > flag significant correlations (box checked) > OK (enter).
Example 3: Data (n = 940) on the WC and the SBP were collected and partial correlation would be tested to understand the relation between the two controlling for confounding factors such as smoking status and education.
Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.381, P < 0.001 [Table 4], implies that a weak positive correlation, yet statistically significant linear relation, is present between WC and SBP after controlling for the effect of confounders, i.e., smoking and education.
How to Perform Correlation Online   
Simplified calculations for correlation analysis can also be performed online using the link: http://www.socscistatistics.com/tests/pearson/default2.aspx
Example 4: Consider the continuous data on weight (n = 15) and WC (n = 15). Calculate the correlation between the selected variables.
Enter the data for variable X, i.e., weight in the designated column and WC in column Y [Figure 6]a.  Figure 6: (a) Data entered for variable X (weight in kg). (b) Correlation graph between the two variables X (Weight in kilograms) and Y (waist circumference in cm). (c) Calculation of R.
Click here to view 
Click on the tab “Calculate R” and the correlation graph would be obtained [Figure 6]b. The value of R would be calculated using standard formulae [Figure 6]c.
Note the value of R for the calculation of P value. Further, calculate the P value using the link: http://www.socscistatistics.com/pvalues/pearsondistribution.aspx for the value of R, i.e., 0.9876 and n = 15. The P value thus obtained is <0.00001.
This can also be done in Excel as shown below:
The data set is between weight and blood sugar. [Figure 7] shows the data, [Figure 8] is a scatter diagram which shows a strong positive correlation, and [Figure 9] shows the correlation calculations.  Figure 8: Creating a scatter plot in Excel. Choose: INSERT > SCATTER > select the data set.
Click here to view 
 Figure 9: Calculating the correlation coefficient. Enter the formula = Pearson (array 1, array 2) in any cell.
Click here to view 
Conclusion   
The technique for testing the strength of linear relationship between two variables is correlation. It can be used for continuous or ordinal set of variables and can also assess the independent relation between the variables controlling for the effect of confounders or other variables.
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References   
1.  Pearson K. Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A. Containing Pap Math Phys Character 1896;187:253318. 
2.  Chan YH. Biostatistics 201: Linear regression analysis. Age (years) 2004;80:140. 
3.  Chan YH. Biostatistics 103: Qualitative datatests of independence. Singapore Med J 2003;44:498503. 
4.  Swinscow TDV, Campbell MJ. Statistics at square one. London: BMJ. 2002; p. 11125. 
5.  Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal 2012;24:6971. 
6.  Hinkle DE, Wiersma W, Jurs SG. Applied Statistics for the Behavioral Sciences. 5 ^{th} ed. Boston: Houghton Mifflin; 2003. 
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9]
[Table 1], [Table 2], [Table 3], [Table 4]
