• Users Online: 348
  • Home
  • Print this page
  • Email this page
Home About us Editorial board Ahead of print Current issue Search Archives Submit article Instructions Subscribe Contacts Login 


 
 Table of Contents  
CURRICULUM IN CARDIOLOGY - STATISTICS
Year : 2018  |  Volume : 4  |  Issue : 2  |  Page : 116-121

Correlation analysis in biological studies


Department of Anthropology, University of Delhi, New Delhi, India

Date of Web Publication10-Sep-2018

Correspondence Address:
Suniti Yadav
Molecular Anthropology Laboratory, Department of Anthropology, University of Delhi, New Delhi - 110 007
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/jpcs.jpcs_31_18

Rights and Permissions
  Abstract 

Correlation is a statistical procedure to test the relationship between quantitative variables and categorical variables. In other words, it describes the degree of relation between two variables. It is one of the most commonly used statistical techniques. The present article is based on selected statistical textbook, review of the literature, and our own research experience study.

Keywords: Correlation, parametric, risk factors-phenotype relation


How to cite this article:
Yadav S. Correlation analysis in biological studies. J Pract Cardiovasc Sci 2018;4:116-21

How to cite this URL:
Yadav S. Correlation analysis in biological studies. J Pract Cardiovasc Sci [serial online] 2018 [cited 2018 Dec 18];4:116-21. Available from: http://www.j-pcs.org/text.asp?2018/4/2/116/240962


  Introduction Top


The concept of correlation was first proposed by Sir Francis Galton in 1894, which was further mathematically described by Karl Pearson in 1896.[1] Correlation analysis is a method of statistical evaluation of the strength of a relationship between two numerically measurable continuous variables.

In biostatistics, univariate statistical tests such as Chi-square test, Fisher's exact test, t-test, and analysis of variance do not allow taking into account the effect of other covariates/confounders during analyses.[2] However, a technique called partial correlation allows the researcher to control the effect of confounders/covariates in understanding the relation between the two selected variables.[3] Partial correlation looks at the relationship between two variables while removing the effects of other variables.

In statistical terms, correlation is a method of assessing a probable two-way linear association between two measurable continuous variables. The extent of “correlation” is measured by a statistic called the correlation coefficient, which represents the strength of the putative linear association between the two selected variables. In other words, it is a statistic representing how closely two variables co-vary; it is a dimensionless quantity whose value can vary from −1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation).[4] A positive coefficient of correlation indicates that the variables are directly related, i.e., as the value of one variable increases, the value of the other variable also tend to increase. On the contrary, if the coefficient is a negative number, it indicates that the selected variables are negatively related, i.e., as the value of one variable increases, the value of other tends to decrease. In statistical terms, any other form of relation between any two continuous variables that is not linear is not considered as correlation.[5]

In biological research, the relation between independent or the predictor variables and outcome or the dependent variable is explored. This explains how the risk factors or the predictor variables account for the possibility of the occurrence of a disease or presence of a phenotype. The disease outcome or the dependent variable is associated with biological factors (such as age and gender), lifestyle variables (such as physical activity, smoking, and alcohol consumption), physiological variables (blood pressure and pulse rate), and genetic factors (genetic mutations). To understand such “risk factors–disease” relationship, two tests may be used, i.e., correlation and regression (Gaddis and Gaddis, 1990). Correlation thus provides a quantitative way of measuring the degree or strength of the relation between the selected variables, whereas regression describes this relation mathematically by predicting the value of the outcome occurrence based on the independent predictor value.[6]


  Types of Correlation Top


Pearson's r correlation

When there is normal distribution of the data or the data are “parametric,” Pearson's correlation “r” is used. It is used between the variables that are linear. Pearson's r correlation is calculated using the following formula:



where r = Pearson's r correlation coefficient

N = number of observations

Σxy = sum of the products of paired scores

Σx = sum of x scores

Σy = sum of y scores

Σx2 = sum of squared x scores

Σy2 = sum of squared y scores.

For the Pearson's r correlation, both variables should be normally distributed (bell-shaped curve I distribution) and have linearity. Linearity assumes a straight line relationship between each of the two variables.

Spearman's rank correlation

Spearman's rank correlation is a nonparametric test used to measure the degree of association between two variables. When the data or the distribution of the selected variables is not normally distributed or “skewed,” Spearman's rank correlation may be used. This test of correlation does not carry any assumptions about the distribution of the data and is used best when the variables are measured on a scale that is at least ordinal and the scores on one variable need to be monotonically related to the other variable.

Spearman's rank correlation is calculated using the following formula:



where ρ = Spearman's rank correlation

di = the difference between the ranks of corresponding variables

n = number of observations.


  Statistical Simulations to Understand the Relationship between Correlation Coefficient and Scatterplots Top


The scatterplot between the selected variables can present their relationship. The higher the correlation between the selected variables, the more is the linear association between them and hence an obvious trend is seen in a scatter plot [Figure 1].
Figure 1: Correlation scatter plots between two variables.

Click here to view


For example, the data depicted in [Figure 2], [Figure 3], [Figure 4], [Figure 5] have been simulated from a bivariate normal distribution of 500 observations with means 2 and 3 for the variables x and y, respectively (Figure source – Mukaka 2012).
Figure 2: Scatterplot of variables x and y; Pearson's correlation = 0.20.

Click here to view
Figure 3: Scatterplot of variables x and y; Pearson's correlation = 0.50.

Click here to view
Figure 4: Scatterplot of variables x and y; Pearson's correlation = 0.80.

Click here to view
Figure 5: Scatterplot of variables x and y; Pearson's correlation = -0.80.

Click here to view


The scatterplot in [Figure 2] shows a linear association trend between the variables x and y, but the trend does not seem to be clear since the coefficient of correlation is low, i.e., 0.20. The trend seems to improve in [Figure 3], where the coefficient of correlation is 0.50. The trend in [Figure 4] and [Figure 5] shows that, higher the correlation in either direction, i.e., positive correlation or negative correlation, the more linear association is visible in the scatterplot. The strength of the correlation between x and y in [Figure 4] and [Figure 5] remains same but in opposite direction. In [Figure 4], when x increases, y also increases, whereas in [Figure 5], when x increases, y decreases or vice versa.

Interpretation of the size of correlation coefficient

The correlation coefficient value may be interpreted from negligible to high positive/negative as shown in [Table 1] (Hinkle et al., 2003).
Table 1: Size of correlation coefficient and its interpretation

Click here to view



  Coefficient of Correlation (R) and Coefficient of Determination (R2) Top


Coefficient of correlation (r) is the degree of relationship between two variables, i.e., x and y, whereas coefficient of determination (R2) shows percentage variation in y which is explained by all the x variables together. The value of “r” may vary from −1 to +1, whereas the value of “r2” lies between 0 and +1.


  Use of Correlation Analysis in Biological Data Top


In biological research, correlation analysis is used to understand the relation between the independent variables (or risk factors) with dependent variable (or the disease outcome). The selected variables may be continuous or ordinal. For example, to know the relation between systolic blood pressure (SBP) (continuous dependent) and risk factors/independent variables such as age (continuous) and weight (continuous), Pearson's correlation analysis would be used. On the contrary, to understand the relation between maternal age (continuous) and parity (ordinal) or number of hospitalization (ordinal) and history of stroke (ordinal), Spearman's correlation analysis would be used.


  How to Perform Correlation in SPSS? Top


Linear regression can be tested through the SPSS statistical software (IBM SPSS Statistics for Windows, IBM Corp., Released 2011, Version 20.0, Armonk, NY, USA) in five steps to analyze data using linear regression. Following is the procedure followed [Table 1], [Table 2], [Table 3], [Table 4].
Table 2: Bivariate (Pearson) correlation analysis between systolic blood pressure and waist circumference

Click here to view
Table 3: Bivariate (Spearman) correlation analysis between systolic blood pressure and waist circumference

Click here to view
Table 4: Partial correlation between systolic blood pressure and waist circumference controlling for smoking and education

Click here to view


Click Analyze > Correlate > Bivariate > select variables > select correlation coefficient > select test of significance (keep two tailed) > flag significant correlations (box checked) > OK (enter).

Example 1: Data (n = 967) on the waist circumference (WC) and SBP were collected and bivariate correlation would be tested to understand the relation between the two.

Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.395, P < 0.001 [Table 2], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and SBP. The coefficient of determination, i.e., R2 is 0.156 ([0.3952]), which implies that WC accounts for only 15.6% variation in the SBP.

Example 2: Data (n = 936) on the WC and the body mass index (BMI) status were collected. BMI status was categorized into underweight, normal, overweight, and obese. Bivariate correlation would be tested to understand the relation between the two.

Since one of the selected variables is continuous (WC), while other is ordinal (BMI status), bivariate correlation analysis is performed using Spearman's correlation coefficient after checking the normality assumptions for both variables. The Spearman's correlation coefficient, i.e., r = 0.398, P < 0.001 [Table 3], implies that a low positive correlation, yet statistically significant linear relation, is present between WC and BMI status. The coefficient of determination, i.e., R2 is 0.158 ([0.3982]), implies that BMI status explains 15.8% variation in the WC.

Correlation analysis can also be used for calculating independent correlation between variables adjusting for the effect of other variables. Such analysis can be done using partial correlation analysis in SPSS. The following command is given:

Click Analyze > Correlate > Partial > select variables > select controlling for (variables) > select test of significance (keep two tailed) > flag significant correlations (box checked) > OK (enter).

Example 3: Data (n = 940) on the WC and the SBP were collected and partial correlation would be tested to understand the relation between the two controlling for confounding factors such as smoking status and education.

Since both the selected variables are continuous, bivariate correlation analysis is performed using Pearson's correlation coefficient after checking the normality assumptions for both variables. The Pearson's correlation coefficient, i.e., r = 0.381, P < 0.001 [Table 4], implies that a weak positive correlation, yet statistically significant linear relation, is present between WC and SBP after controlling for the effect of confounders, i.e., smoking and education.


  How to Perform Correlation Online Top


Simplified calculations for correlation analysis can also be performed online using the link: http://www.socscistatistics.com/tests/pearson/default2.aspx

Example 4: Consider the continuous data on weight (n = 15) and WC (n = 15). Calculate the correlation between the selected variables.

Enter the data for variable X, i.e., weight in the designated column and WC in column Y [Figure 6]a.
Figure 6: (a) Data entered for variable X (weight in kg). (b) Correlation graph between the two variables X (Weight in kilograms) and Y (waist circumference in cm). (c) Calculation of R.

Click here to view


Click on the tab “Calculate R” and the correlation graph would be obtained [Figure 6]b. The value of R would be calculated using standard formulae [Figure 6]c.

Note the value of R for the calculation of P value. Further, calculate the P value using the link: http://www.socscistatistics.com/pvalues/pearsondistribution.aspx for the value of R, i.e., 0.9876 and n = 15. The P value thus obtained is <0.00001.

This can also be done in Excel as shown below:

The data set is between weight and blood sugar. [Figure 7] shows the data, [Figure 8] is a scatter diagram which shows a strong positive correlation, and [Figure 9] shows the correlation calculations.
Figure 7: Data set for Excel

Click here to view
Figure 8: Creating a scatter plot in Excel. Choose: INSERT > SCATTER > select the data set.

Click here to view
Figure 9: Calculating the correlation coefficient. Enter the formula = Pearson (array 1, array 2) in any cell.

Click here to view



  Conclusion Top


The technique for testing the strength of linear relationship between two variables is correlation. It can be used for continuous or ordinal set of variables and can also assess the independent relation between the variables controlling for the effect of confounders or other variables.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

 
  References Top

1.
Pearson K. Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A. Containing Pap Math Phys Character 1896;187:253-318.  Back to cited text no. 1
    
2.
Chan YH. Biostatistics 201: Linear regression analysis. Age (years) 2004;80:140.  Back to cited text no. 2
    
3.
Chan YH. Biostatistics 103: Qualitative data-tests of independence. Singapore Med J 2003;44:498-503.  Back to cited text no. 3
    
4.
Swinscow TDV, Campbell MJ. Statistics at square one. London: BMJ. 2002; p. 111-25.  Back to cited text no. 4
    
5.
Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal 2012;24:69-71.  Back to cited text no. 5
    
6.
Hinkle DE, Wiersma W, Jurs SG. Applied Statistics for the Behavioral Sciences. 5th ed. Boston: Houghton Mifflin; 2003.  Back to cited text no. 6
    


    Figures

  [Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9]
 
 
    Tables

  [Table 1], [Table 2], [Table 3], [Table 4]



 

Top
 
 
  Search
 
Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

 
  In this article
   Abstract
  Introduction
  Types of Correlation
   Statistical Simu...
   Coefficient of C...
   Use of Correlati...
   How to Perform C...
  Conclusion
   How to Perform C...
   References
   Article Figures
   Article Tables

 Article Access Statistics
    Viewed251    
    Printed9    
    Emailed0    
    PDF Downloaded47    
    Comments [Add]    

Recommend this journal


[TAG2]
[TAG3]
[TAG4]