REVIEW ARTICLE
Year : 2019  Volume
: 5  Issue : 1  Page : 1213
Machine learning and medical research data analysis
Rajiv Narang^{1}, Jaya Deva^{2}, Sada Nand Dwivedi^{3}, ^{1} Department of Cardiology, All India Institute of Medical Sciences, New Delhi, India ^{2} Department of Electrical Engineering, Indian Institute of Technology, New Delhi, India ^{3} Department of Biostatistics, All India Institute of Medical Sciences, New Delhi, India
Correspondence Address:
Dr. Rajiv Narang Department of Cardiology, All India Institute of Medical Sciences, Ansari Nagar, New Delhi  110 029 India
How to cite this article:
Narang R, Deva J, Dwivedi SN. Machine learning and medical research data analysis.J Pract Cardiovasc Sci 2019;5:1213

How to cite this URL:
Narang R, Deva J, Dwivedi SN. Machine learning and medical research data analysis. J Pract Cardiovasc Sci [serial online] 2019 [cited 2019 Jul 19 ];5:1213
Available from: http://www.jpcs.org/text.asp?2019/5/1/12/257597 
Full Text
Developments over last few years may change (statistically significantly!) the way we analyze our data. These include wide availability of powerful computers (especially with graphics processing units or GPUs that allow large scale, parallelized computations), open source programming languages, for example, R (https://cran.rproject.org) and Python (https://www.python.org) as well as machine learning (ML) software (e.g., ScikitLearn, Theano, TensorFlow, Caffe, Weka, and Apache Spark). As a result, ML is increasingly being used for data analysis in medicine.[1],[2],[3]
ML algorithms can be supervised or unsupervised depending on whether a class or outcome variable is available. In addition to commonly used linear and logistic regression, many generalized linear models are available such as Ridge regression, Lasso, Elastic net, Least Angle Regression, Bayesian regression, Perceptron, Random sample consensus, Theil–Sen estimator, and Huber regression. Last 3 have the advantage of being robust to outliers. Currently, only linear and logistic regression analyses are being used widely in medical studies.
Supervised learning also includes many nonlinear techniques such as Linear and Quadratic Discriminant Analysis, Kernel ridge regression, Support vector machines, Stochastic gradient descent, Nearest Neighbor Gaussian processes, Cross decomposition, Naive Bayes (e.g., Bernoulli, Gaussian, and Multinomial), Decision trees (Decision tree, Extra tree), Ensemble methods (including Bagging, Random Forest, Ada Boost, Gradient Tree Boost, and Voting classifier), and supervised neural network models (e.g., multilayer perceptron). Cross decomposition techniques including the partial least squares and the canonical correlation analysis can find relationships between 2 matrices and hence can be used when the group or outcome variable is also multivariate like predictor variables. Most of techniques mentioned above can be used for regression (as an alternative to linear regression) and for classification (as alternative to logistic regression).
Unsupervised ML methods are helpful in dimensionality reduction and in analyzing data with large number of features but less number of cases, a scenario where linear and logistic regression techniques are not reliable. They include decompositions (e.g., principal component analysis [PCA] and its variants such as kernel PCA, iterative PCA, robust PCA, sparse PCA, weighted PCA, entropy component analysis, truncated singular value decomposition, nonnegative matrix factorization, independent component analysis, and factor analysis), Gaussian mixture models, Manifold learning (e.g., isomap, local linear embedding, nonlinear spectral embedding and multidimensional scaling), clustering (e.g., Kmeans, spectral clustering, hierarchical clustering, DBSCAN, Birch), covariance estimation density estimation, and unsupervised neural network models (restricted Boltzmann machines).
Many of these algorithms are lengthy and were traditionally timeconsuming but can now be easily performed on modern fast computers. All these techniques have different assumptions, advantages, disadvantages, and situations where they are most useful. These algorithms may produce models and results different from linear/logistic regression, but they may actually be closer to truth. The average values of coefficients obtained from multiple algorithms are also likely to be indicative of true relationships. It will only be prudent to make use of such variety of available techniques for medical research data analysis.
Several feature selection methods are also available to select out features responsible for high variance while rejecting features with low variance. These include sequential feature selection, minimum redundancy maximum relevance, correlation feature selection, regularized trees, relief, information gainbased feature selection, among others. These are broadly categorized into filter, wrapper, and embedded methods (that incorporate both feature selection and learning). Individual feature importance can also be determined by many methods, such as Gradient Boost, Ada Boost, Extra Trees, Decision Tree, and Random Forest. In addition, many techniques for model selection and evaluation are also available. These include crossvalidation, model persistence, and model curves. Comparison of different algorithms using crossvalidation is especially popular.
Unconventional approaches used by ML techniques may result in unexpected benefits. For example, Simjanoska et al. were able to accurately determine blood pressure from raw electrocardiographic data![4] Moreover, multiple techniques can now easily be applied to medical data.[5] For example, [Figure 1] shows results of regression analysis of factors associated with low birth weight from publically available “birthwt” dataset using 20 different regression algorithms. Similarly, on running 13 classification algorithms for feature selection on publically available South African Heart Disease (”sahd”) dataset, it was found that age, tobacco, lowdensity lipoprotein, family history, and Type A personality were selected by 92%, 85%, 77%, 69%, and 62% algorithms, respectively. Obesity, adiposity, systolic blood pressure, and alcohol intake were selected by only 23%, 15%, 8%, and 8% algorithms, respectively. A metaanalysis of this kind involving results from different algorithms applied to same data is likely to produce answers as correct as metaanalysis of data from different studies using the same single algorithm.{Figure 1}
Financial support and sponsorship
Nil.
Conflicts of interest
There are no conflicts of interest.
References
1  Baum A, Scarpa J, Bruzelius E, Tamler R, Basu S, Faghmous J. Targeting weight loss interventions to reduce cardiovascular complications of type 2 diabetes: A machine learningbased post hoc analysis of heterogeneous treatment effects in the look AHEAD trial. Lancet Diabetes Endocrinol 2017;5:80815. 
2  Motwani M, Dey D, Berman DS, Germano G, Achenbach S, AlMallah MH, et al. Machine learning for prediction of allcause mortality in patients with suspected coronary artery disease: A 5year multicentre prospective registry analysis. Eur Heart J 2017;38:5007. 
3  Ahmad T, Lund LH, Rao P, Ghosh R, Warier P, Vaccaro B, et al. Machine learning methods improve prognostication, identify clinically distinct phenotypes, and detect heterogeneity in response to therapy in a large cohort of heart failure patients. J Am Heart Assoc 2018;7. pii: e008081. 
4  Simjanoska M, Gjoreski M, Gams M, Madevska Bogdanova A. Noninvasive blood pressure estimation from ECG using machine learning techniques. Sensors (Basel) 2018;18. pii: E1160. 
5  Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, Keteyian S, et al. Using machine learning on cardiorespiratory fitness data for predicting hypertension: The Henry Ford ExercIse Testing (FIT) project. PLoS One 2018;13:e0195344. 
