CURRICULUM IN CARDIOLOGY  STATISTICS Year : 2018  Volume : 4  Issue : 2  Page : 116121 Correlation analysis in biological studies Suniti Yadav Department of Anthropology, University of Delhi, New Delhi, India Correspondence Address: Correlation is a statistical procedure to test the relationship between quantitative variables and categorical variables. In other words, it describes the degree of relation between two variables. It is one of the most commonly used statistical techniques. The present article is based on selected statistical textbook, review of the literature, and our own research experience study.
Introduction The concept of correlation was first proposed by Sir Francis Galton in 1894, which was further mathematically described by Karl Pearson in 1896.[1] Correlation analysis is a method of statistical evaluation of the strength of a relationship between two numerically measurable continuous variables. In biostatistics, univariate statistical tests such as Chisquare test, Fisher's exact test, ttest, and analysis of variance do not allow taking into account the effect of other covariates/confounders during analyses.[2] However, a technique called partial correlation allows the researcher to control the effect of confounders/covariates in understanding the relation between the two selected variables.[3] Partial correlation looks at the relationship between two variables while removing the effects of other variables. In statistical terms, correlation is a method of assessing a probable twoway linear association between two measurable continuous variables. The extent of “correlation” is measured by a statistic called the correlation coefficient, which represents the strength of the putative linear association between the two selected variables. In other words, it is a statistic representing how closely two variables covary; it is a dimensionless quantity whose value can vary from −1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation).[4] A positive coefficient of correlation indicates that the variables are directly related, i.e., as the value of one variable increases, the value of the other variable also tend to increase. On the contrary, if the coefficient is a negative number, it indicates that the selected variables are negatively related, i.e., as the value of one variable increases, the value of other tends to decrease. In statistical terms, any other form of relation between any two continuous variables that is not linear is not considered as correlation.[5] In biological research, the relation between independent or the predictor variables and outcome or the dependent variable is explored. This explains how the risk factors or the predictor variables account for the possibility of the occurrence of a disease or presence of a phenotype. The disease outcome or the dependent variable is associated with biological factors (such as age and gender), lifestyle variables (such as physical activity, smoking, and alcohol consumption), physiological variables (blood pressure and pulse rate), and genetic factors (genetic mutations). To understand such “risk factors–disease” relationship, two tests may be used, i.e., correlation and regression (Gaddis and Gaddis, 1990). Correlation thus provides a quantitative way of measuring the degree or strength of the relation between the selected variables, whereas regression describes this relation mathematically by predicting the value of the outcome occurrence based on the independent predictor value.[6] Types of Correlation Pearson's r correlation When there is normal distribution of the data or the data are “parametric,” Pearson's correlation “r” is used. It is used between the variables that are linear. Pearson's r correlation is calculated using the following formula: [INLINE:1] where r = Pearson's r correlation coefficient N = number of observations Σxy = sum of the products of paired scores Σx = sum of x scores Σy = sum of y scores Σx2 = sum of squared x scores Σy2 = sum of squared y scores. For the Pearson's r correlation, both variables should be normally distributed (bellshaped curve I distribution) and have linearity. Linearity assumes a straight line relationship between each of the two variables. Spearman's rank correlation Spearman's rank correlation is a nonparametric test used to measure the degree of association between two variables. When the data or the distribution of the selected variables is not normally distributed or “skewed,” Spearman's rank correlation may be used. This test of correlation does not carry any assumptions about the distribution of the data and is used best when the variables are measured on a scale that is at least ordinal and the scores on one variable need to be monotonically related to the other variable. Spearman's rank correlation is calculated using the following formula: [INLINE:2] where ρ = Spearman's rank correlation di = the difference between the ranks of corresponding variables n = number of observations. Statistical Simulations to Understand the Relationship between Correlation Coefficient and Scatterplots The scatterplot between the selected variables can present their relationship. The higher the correlation between the selected variables, the more is the linear association between them and hence an obvious trend is seen in a scatter plot [Figure 1].{Figure 1} For example, the data depicted in [Figure 2], [Figure 3], [Figure 4], [Figure 5] have been simulated from a bivariate normal distribution of 500 observations with means 2 and 3 for the variables x and y, respectively (Figure source – Mukaka 2012).{Figure 2}{Figure 3}{Figure 4}{Figure 5} The scatterplot in [Figure 2] shows a linear association trend between the variables x and y, but the trend does not seem to be clear since the coefficient of correlation is low, i.e., 0.20. The trend seems to improve in [Figure 3], where the coefficient of correlation is 0.50. The trend in [Figure 4] and [Figure 5] shows that, higher the correlation in either direction, i.e., positive correlation or negative correlation, the more linear association is visible in the scatterplot. The strength of the correlation between x and y in [Figure 4] and [Figure 5] remains same but in opposite direction. In [Figure 4], when x increases, y also increases, whereas in [Figure 5], when x increases, y decreases or vice versa. Interpretation of the size of correlation coefficient The correlation coefficient value may be interpreted from negligible to high positive/negative as shown in [Table 1] (Hinkle et al., 2003).{Table 1} Coefficient of Correlation (R) and Coefficient of Determination (R2) Coefficient of correlation (r) is the degree of relationship between two variables, i.e., x and y, whereas coefficient of determination (R2) shows percentage variation in y which is explained by all the x variables together. The value of “r” may vary from −1 to +1, whereas the value of “r2” lies between 0 and +1. Use of Correlation Analysis in Biological Data In biological research, correlation analysis is used to understand the relation between the independent variables (or risk factors) with dependent variable (or the disease outcome). The selected variables may be continuous or ordinal. For example, to know the relation between systolic blood pressure (SBP) (continuous dependent) and risk factors/independent variables such as age (continuous) and weight (continuous), Pearson's correlation analysis would be used. On the contrary, to understand the relation between maternal age (continuous) and parity (ordinal) or number of hospitalization (ordinal) and history of stroke (ordinal), Spearman's correlation analysis would be used. How to Perform Correlation in SPSS? Linear regression can be tested through the SPSS statistical software (IBM SPSS Statistics for Windows, IBM Corp., Released 2011, Version 20.0, Armonk, NY, USA) in five steps to analyze data using linear regression. Following is the procedure followed [Table 1], [Table 2], [Table 3], [Table 4].{Table 2}{Table 3}{Table 4} Click Analyze > Correlate > Bivariate > select variables > select correlation coefficient > select test of significance (keep two tailed) > flag significant correlations (box checked) > OK (enter). Example 1: Data (n = 967) on the waist circumference (WC) and SBP were collected and bivariate correlation would be tested to understand the relation between the two. Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.395, P < 0.001 [Table 2], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and SBP. The coefficient of determination, i.e., R2 is 0.156 ([0.3952]), which implies that WC accounts for only 15.6% variation in the SBP. Example 2: Data (n = 936) on the WC and the body mass index (BMI) status were collected. BMI status was categorized into underweight, normal, overweight, and obese. Bivariate correlation would be tested to understand the relation between the two. Since one of the selected variables is continuous (WC), while other is ordinal (BMI status), bivariate correlation analysis is performed using Spearman's correlation coefficient after checking the normality assumptions for both variables. The Spearman's correlation coefficient, i.e., r = 0.398, P < 0.001 [Table 3], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and BMI status. The coefficient of determination, i.e., R2 is 0.158 ([0.3982]), implies that BMI status explains 15.8% variation in the WC. Correlation analysis can also be used for calculating independent correlation between variables adjusting for the effect of other variables. Such analysis can be done using partial correlation analysis in SPSS. The following command is given: Click Analyze > Correlate > Partial > select variables > select controlling for (variables) > select test of significance (keep two tailed) > flag significant correlations (box checked) > OK (enter). Example 3: Data (n = 940) on the WC and the SBP were collected and partial correlation would be tested to understand the relation between the two controlling for confounding factors such as smoking status and education. Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.381, P < 0.001 [Table 4], implies that a weak positive correlation, yet statistically significant linear relation, is present between WC and SBP after controlling for the effect of confounders, i.e., smoking and education. How to Perform Correlation Online Simplified calculations for correlation analysis can also be performed online using the link: http://www.socscistatistics.com/tests/pearson/default2.aspx Example 4: Consider the continuous data on weight (n = 15) and WC (n = 15). Calculate the correlation between the selected variables. Enter the data for variable X, i.e., weight in the designated column and WC in column Y [Figure 6]a.{Figure 6} Click on the tab “Calculate R” and the correlation graph would be obtained [Figure 6]b. The value of R would be calculated using standard formulae [Figure 6]c. Note the value of R for the calculation of P value. Further, calculate the P value using the link: http://www.socscistatistics.com/pvalues/pearsondistribution.aspx for the value of R, i.e., 0.9876 and n = 15. The P value thus obtained is <0.00001. This can also be done in Excel as shown below: The data set is between weight and blood sugar. [Figure 7] shows the data, [Figure 8] is a scatter diagram which shows a strong positive correlation, and [Figure 9] shows the correlation calculations.{Figure 7}{Figure 8}{Figure 9} Conclusion The technique for testing the strength of linear relationship between two variables is correlation. It can be used for continuous or ordinal set of variables and can also assess the independent relation between the variables controlling for the effect of confounders or other variables. Financial support and sponsorship Nil. Conflicts of interest There are no conflicts of interest. References


