• Users Online: 791
  • Home
  • Print this page
  • Email this page
Home About us Editorial board Ahead of print Current issue Search Archives Submit article Instructions Subscribe Contacts Login 


 
 Table of Contents  
CURRICULUM IN CARDIOLOGY - STATISTICS
Year : 2018  |  Volume : 4  |  Issue : 1  |  Page : 33-36

Linear regression analysis study


Department of Anthropology, University of Delhi, New Delhi, India

Date of Web Publication4-May-2018

Correspondence Address:
Khushbu Kumari
Department of Anthropology, University of Delhi, New Delhi
India
Login to access the Email id

Source of Support: None, Conflict of Interest: None


DOI: 10.4103/jpcs.jpcs_8_18

Rights and Permissions
  Abstract 

Linear regression is a statistical procedure for calculating the value of a dependent variable from an independent variable. Linear regression measures the association between two variables. It is a modeling technique where a dependent variable is predicted based on one or more independent variables. Linear regression analysis is the most widely used of all statistical techniques. This article explains the basic concepts and explains how we can do linear regression calculations in SPSS and excel.

Keywords: Continuous variable test, excel and SPSS analysis, linear regression


How to cite this article:
Kumari K, Yadav S. Linear regression analysis study. J Pract Cardiovasc Sci 2018;4:33-6

How to cite this URL:
Kumari K, Yadav S. Linear regression analysis study. J Pract Cardiovasc Sci [serial online] 2018 [cited 2019 Dec 16];4:33-6. Available from: http://www.j-pcs.org/text.asp?2018/4/1/33/231939


  Introduction Top


The concept of linear regression was first proposed by Sir Francis Galton in 1894. Linear regression is a statistical test applied to a data set to define and quantify the relation between the considered variables. Univariate statistical tests such as Chi-square, Fisher's exact test, t-test, and analysis of variance (ANOVA) do not allow taking into account the effect of other covariates/confounders during analyses (Chang 2004). However, partial correlation and regression are the tests that allow the researcher to control the effect of confounders in the understanding of the relation between two variables (Chang 2003).

In biomedical or clinical research, the researcher often tries to understand or relate two or more independent (predictor) variables to predict an outcome or dependent variable. This may be understood as how the risk factors or the predictor variables or independent variables account for the prediction of the chance of a disease occurrence, i.e., dependent variable. Risk factors (or dependent variables) associate with biological (such as age and gender), physical (such as body mass index and blood pressure [BP]), or lifestyle (such as smoking and alcohol consumption) variables with the disease. Both correlation and regression provide this opportunity to understand the “risk factors-disease” relationship (Gaddis and Gaddis 1990). While correlation provides a quantitative way of measuring the degree or strength of a relation between two variables, regression analysis mathematically describes this relationship. Regression analysis allows predicting the value of a dependent variable based on the value of at least one independent variable.

In correlation analysis, the correlation coefficient “r” is a dimensionless number whose value ranges from −1 to +1. A value toward −1 indicates inverse or negative relationship, whereas towards +1 indicate a positive relation. When there is a normal distribution, the Pearson's correlation is used, whereas, in nonnormally distributed data, Spearman's rank correlation is used.

The linear regression analysis uses the mathematical equation, i.e., y = mx + c, that describes the line of best fit for the relationship between y (dependent variable) and x (independent variable). The regression coefficient, i.e., r2 implies the degree of variability of y due to x.[1],[2],[3],[4],[5],[6],[7],[8]


  Significance of Linear Regression Top


The use of linear regression model is important for the following reasons:

  1. Descriptive – It helps in analyzing the strength of the association between the outcome (dependent variable) and predictor variables
  2. Adjustment – It adjusts for the effect of covariates or the confounders
  3. Predictors – It helps in estimating the important risk factors that affect the dependent variable
  4. Extent of prediction – It helps in analyzing the extent of change in the independent variable by one “unit” would affect the dependent variable
  5. Prediction – It helps in quantifying the new cases.



  Assumptions for Linear Regression Top


The underlying assumptions for linear regression are:

  1. The values of independent variable “x” are set by the researcher
  2. The independent variable “x” should be measured without any experimental error
  3. For each value of “x,” there is a subpopulation of “y” variables that are normally distributed up and down the Y-axis [Figure 1]
  4. The variances of the subpopulations of “y” are homogeneous
  5. The mean values of the subpopulations of “y” lie on a straight line, thus implying the assumption that there exists a linear relation between the dependent and the independent variables
  6. All the values of “y” are independent from each other, though dependent on “x.”
Figure 1: Scatter plot of systolic blood pressure versus age.

Click here to view



  Coefficient of Determination, R2 Top


The coefficient of determination is the portion of the total variation in the dependent variable that can be explained by variation in the independent variable(s). When R2 is + 1, there exists a perfect linear relationship between x and y, i.e., 100% of the variation in y is explained by variation in x. When it is 0< R2<1, there is a weaker linear relationship between x and y, i.e., some, but not all of the variation in y is explained by variation in x.


  Linear Regression in Biological Data Analysis Top


In biological or medical data, linear regression is often used to describe relationships between two variables or among several variables through statistical estimation. For example, to know whether the likelihood of having high systolic BP (SBP) is influenced by factors such as age and weight, linear regression would be used. The variable to be explained, i.e., SBP is called the dependent variable, or alternatively, the response variables that explain it age, weight, and sex are called independent variables.


  How to Calculate Linear Regression? Top


Linear regression can be tested through the SPSS statistical software (IBM Corp. Released 2011. IBM SPSS Statistics for Windows, Version 20.0. Armonk, NY: IBM Corp.) in five steps to analyze data using linear regression. Following is the procedure followed [Table 1], [Table 2], [Table 3], [Table 4]:
Table 1: SPSS table

Click here to view
Table 2: SPSS output with R2

Click here to view
Table 3: Analysis of variance with P

Click here to view
Table 4: SPSS equation variables

Click here to view


Click Analyze > Regression > Linear > then select Dependent and Independent variable > OK (enter).

Example 1 – Data (n = 55) on the age and the SBP were collected and linear regression model would be tested to predict BP with age. After checking the normality assumptions for both variables, bivariate correlation is tested (Pearson's correlation = 0.696, P < 0.001) and a graphical scatter plot is helpful in that case [Figure 2].
Figure 2: Starting Data Analysis ToolPak. Click the OFFICE button and choose Excel options.

Click here to view


Now to check the linear regression, put SBP as the dependent and age as the Independent variable.

This indicates the dependent and independent variables included in the test.

Pearson's correlation between SBP and age is given (r = 0.696). R2 = 0.485 which implies that only 48.5% of the SBP is explained by the age of a person.

The ANOVA table shows the “usefulness” of the linear regression model with P < 0.05.

This provides the quantification of the relationship between age and SBP. With every increase of 1 year in age, the SBP (on the average) increases by 1.051 (95% confidence interval 0.752–1.350) units, P < 0.001. The constant here has no “practical” meaning as it gives the value of the SBP when age = 0.

Further, if more than one independent variable is added, the linear regression model would adjust for the effect of other dependent variables when testing the effect of one variable.

Example 2 – If we want to see the genetic effect of variables, i.e., the effect of increase in per allele dose of any genetic variant (mutation) on the disease or phenotype, linear regression is used in a similar way as described above. The three genotypes, i.e., normal homozygote AA, heterozygote AB and homozygote mutant BB may be coded as 1, 2, and 3, respectively. The test may be preceded, and in a similar way, the unstandardized coefficient (β) would explain the effect on the dependent variable with per allele dose increase.

Example 3 – Using Excel to see the relationship between sale of medicine with the price of the medicine and TV advertisements.

[Table 5] contains data which can be entered into an Excel sheet. Follow instructions as shown in [Figure 2], [Figure 3], [Figure 4].
Table 5: Excel data set

Click here to view
Figure 3: The Tool Pak. Choose Add Ins > Choose Analysis ToolPak and select Go.

Click here to view
Figure 4: The regression screen. Choose Data > Data Analysis > Regression. Input y Range: A1:A8. Input X Range: B1:C8. Check Labels, Residuals, Output Range as A50.

Click here to view


As shown in [Table 6], Multiple R is the Correlation Coefficient, where 1 means a perfect correlation and zero means none. R Square is the coefficient of determination which here means that 92% of the variation can be explained by the variables. Adjusted R square adjusts for multiple variables and should be used here. here. [Table 7] shows how to create a linear regression equation from the data.
Table 6: Summary output

Click here to view
Table 7: Analysis of variance

Click here to view



  Conclusion Top


The techniques for testing the relationship between two variables are correlation and linear regression. Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation. In this article, we have used simple examples and SPSS and excel to illustrate linear regression analysis and encourage the readers to analyze their data by these techniques.

Financial support and sponsorship

Nil.

Conflicts of interest

There are no conflicts of interest.

 
  References Top

1.
Schneider A, Hommel G, Blettner M. Linear regression analysis: Part 14 of a series on evaluation of scientific publications. Dtsch Arztebl Int 2010;107:776-82.  Back to cited text no. 1
[PUBMED]    
2.
Freedman DA. Statistical Models: Theory and Practice. Cambridge, USA: Cambridge University Press; 2009.  Back to cited text no. 2
    
3.
Chan YH. Biostatistics 201: Linear regression analysis. Age (years). Singapore Med J 2004;45:55-61.  Back to cited text no. 3
[PUBMED]    
4.
Chan YH. Biostatistics 103: Qualitative data – Tests of independence. Singapore Med J 2003;44:498-503.  Back to cited text no. 4
[PUBMED]    
5.
Gaddis ML, Gaddis GM. Introduction to biostatistics: Part 6, correlation and regression. Ann Emerg Med 1990;19:1462-8.  Back to cited text no. 5
[PUBMED]    
6.
Mendenhall W, Sincich T. Statistics for Engineering and the Sciences. 3rd ed. New York: Dellen Publishing Co.; 1992.  Back to cited text no. 6
    
7.
Panchenko D. 18.443 Statistics for Applications, Section 14, Simple Linear Regression. Massachusetts Institute of Technology: MIT OpenCourseWare; 2006.  Back to cited text no. 7
    
8.
Elazar JP. Multiple Regression in Behavioral Research: Explanation and Prediction. 2nd ed. New York: Holt, Rinehart and Winston; 1982.  Back to cited text no. 8
    


    Figures

  [Figure 1], [Figure 2], [Figure 3], [Figure 4]
 
 
    Tables

  [Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6], [Table 7]



 

Top
 
 
  Search
 
Similar in PUBMED
   Search Pubmed for
   Search in Google Scholar for
 Related articles
Access Statistics
Email Alert *
Add to My List *
* Registration required (free)

 
  In this article
   Abstract
  Introduction
   Significance of ...
   Assumptions for ...
   Coefficient of D...
   Linear Regressio...
   How to Calculate...
  Conclusion
   References
   Article Figures
   Article Tables

 Article Access Statistics
    Viewed6177    
    Printed113    
    Emailed0    
    PDF Downloaded911    
    Comments [Add]    

Recommend this journal


[TAG2]
[TAG3]
[TAG4]