{"id":661,"date":"2018-06-03T15:27:06","date_gmt":"2018-06-03T15:27:06","guid":{"rendered":"http:\/\/muthu.co\/?p=661"},"modified":"2021-05-24T03:47:04","modified_gmt":"2021-05-24T03:47:04","slug":"multiple-linear-regression-with-python-on-framingham-heart-study-data","status":"publish","type":"post","link":"http:\/\/write.muthu.co\/multiple-linear-regression-with-python-on-framingham-heart-study-data\/","title":{"rendered":"Multiple Linear Regression with Python on Framingham Heart Study data"},"content":{"rendered":"\n<p>Previously we built a <a href=\"http:\/\/muthu.co\/math-behind-linear-regression-and-python-code\/\">simple linear regression model <\/a> using a single explanatory variable to predict the price of pizza from its diameter. But in the real world the price of pizza cannot be entirely derived from the diameter of its base alone. It also depends on the toppings, which means there are a many more independent variables to be used in the prediction equation. A regression using multiple explanatory variables is called multiple linear regression. A simple linear regression uses a single explanatory variable with a single coefficient whereas a multiple linear regression uses a coefficient for each explanatory variables but a single dependant variable.<\/p>\n\n\n\n<p>The equation of multiple linear regression looks like:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/05\/Snip20180530_71.png\"><img loading=\"lazy\" decoding=\"async\" width=\"306\" height=\"58\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/05\/Snip20180530_71.png\" alt=\"\" class=\"wp-image-662\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/05\/Snip20180530_71.png 306w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/05\/Snip20180530_71-300x57.png 300w\" sizes=\"(max-width: 306px) 100vw, 306px\" \/><\/a><\/figure><\/div>\n\n\n\n<p>where<em> x<sub>1<\/sub>,x<sub>2<\/sub>&#8230;&#8230;x<sub>n&nbsp;<\/sub><\/em>represent our explanatory variables and <em>y&nbsp;<\/em>is our dependant variable. Mathematics of multiple linear regression involves complex matrix operations, a more detailed explanation of which I will keep for another post.<\/p>\n\n\n\n<p>In this post we will focus on building a multiple linear regression modal using the&nbsp;<a href=\"https:\/\/en.wikipedia.org\/wiki\/Framingham_Heart_Study\"><b>Framingham Heart Study data<\/b><\/a>&nbsp;which is a long-term, ongoing&nbsp;cardiovascular&nbsp;cohort study&nbsp;on residents of the town of&nbsp;Framingham,&nbsp;Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its third generation of participants.&nbsp;Much of the now-common knowledge concerning heart disease, such as the effects of&nbsp;diet,&nbsp;exercise, and common medications such as&nbsp;aspirin, is based on this study.&nbsp;One of the important observations from Framingham heart study was the strong correlation between BMI and blood pressure. People with higher BMI are at a higher risk of cardiovascular disease.&nbsp;Blood Pressure is also associated with rising age. Lets see if can derive these observations using our regression modal.<\/p>\n\n\n\n<p>First, lets download the&nbsp;Framingham data from<a href=\"http:\/\/sphweb.bumc.bu.edu\/otlt\/MPH-Modules\/QuantCore\/PH717_MultipleVariableRegression\/fram1.csv\"> here<\/a>. This dataset has a long list of columns from which we will be picking&nbsp;<strong><em>SYSBP, BMI, AGE, SEX <\/em><\/strong>and&nbsp;<strong><em>BPMEDS<\/em><\/strong>&nbsp;to build our modal.<\/p>\n\n\n\n<p><em><strong>SYSBP<\/strong><\/em> &#8211; Systolic blood pressure is what we are trying to predict so it becomes our dependant variable <em>Y.<\/em> The below table shows the different categories of BP with their standard values.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_84.png\"><img loading=\"lazy\" decoding=\"async\" width=\"632\" height=\"419\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_84.png\" alt=\"SYS BP\" class=\"wp-image-670\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_84.png 632w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_84-300x199.png 300w\" sizes=\"(max-width: 632px) 100vw, 632px\" \/><\/a><figcaption>Source: www.heart.org<\/figcaption><\/figure><\/div>\n\n\n\n<p><em><strong>BMI, AGE, SEX<\/strong> and <strong>BPMEDS<\/strong><\/em>&nbsp;(an indicator for antihypertensive medication use), all four variables will be our explanatory variable <em>X. The standard values of BMI are as below:<\/em><\/p>\n\n\n\n<ul><li>Underweight: Your BMI is less than 18.5<\/li><li>Healthy weight: Your BMI is 18.5 to 24.9<\/li><li><a href=\"https:\/\/www.webmd.com\/diet\/obesity\/features\/am-i-obese\" data-metrics-link=\"\" data-crosslink-type=\"article\">Overweight<\/a>: Your BMI is 25 to 29.9<\/li><li><a href=\"https:\/\/www.webmd.com\/diet\/obesity\/video\/obesity-risks\" data-metrics-link=\"\" data-crosslink-type=\"\">Obese<\/a>: Your BMI is 30 or higher<\/li><\/ul>\n\n\n\n<div>Lets write our Python Program step by step:<\/div>\n\n\n\n<p><strong>1. Importing the data<\/strong>\u00a0&#8211; We would be using pandas library for importing our data into a dataframe. A short introduction to pandas library can be found\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/version\/0.22.0\/10min.html\">here<\/a>.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Importing the dataset\ndataset = pd.read_csv('fram1.csv')\n#ignoring a few null values from the dataset.\ndataset = dataset&#91;~dataset&#91;'BMI'].isnull()] \nX = dataset.iloc&#91;:, &#91;0,8,3,6]].values #select SEX,BMI,AGE,CURSMOKE #,3,6\ny = dataset.iloc&#91;:, 4].values #select SYSBP<\/code><\/pre>\n\n\n\n<p>Upon trying to build the modal I found that few rows have the BMI empty, so I modified the program to remove null rows as you can see above.<\/p>\n\n\n\n<p><strong>2. Encoding the categorical data &#8211;&nbsp;<\/strong>The values in the field of SEX indicating male and female in the dataset is 1 and 2. We need to encode this using a label encoder which will give use 0 and 1 for Male and Female.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Encoding categorical data\nfrom sklearn.preprocessing import LabelEncoder\n\nlabelencoder = LabelEncoder()\nX&#91;:, 0] = labelencoder.fit_transform(X&#91;:, 0])<\/code><\/pre>\n\n\n\n<p><strong>3. Splitting the dataset into training and test set<\/strong>&nbsp;&#8211; This is an important aspect to model building which helps us in assessing our model&#8217;s predictive capabilities. The evaluation of which is done using the&nbsp;R-squared method. R-squared measures how well the observed values of the response variables are predicted by the model. In python we can simply use the&nbsp;<strong><em>score<\/em><\/strong>&nbsp;method to calculate the R-square. An r-squared score of 1 indicates that the response variable can be predicted without any error using the model. An r-squared score of one half indicates that half of the variance in the response variable can be predicted using the model.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Splitting the dataset into the Training set and Test set\nfrom sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)\n<\/code><\/pre>\n\n\n\n<p><strong>4. Fitting the Multiple regression to training set<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># Fitting Multiple Linear Regression to the Training set\nfrom sklearn.linear_model import LinearRegression\nregressor = LinearRegression()\nregressor.fit(X_train, y_train)\n\n# Predicting the Test set results\ny_pred = regressor.predict(X)\n\nprint 'R-squared: %.2f' % regressor.score(X_test, y_test)\nprint regressor.coef_<\/code><\/pre>\n\n\n\n<p>Outputs:<\/p>\n\n\n\n<ul><li>R-squared: 0.26<\/li><li>Regression Coef: 2.51226344&nbsp;,&nbsp;1.55951204,&nbsp;&nbsp;&nbsp;0.92496226, -0.16286499<\/li><\/ul>\n\n\n\n<p><strong>5. Plotting the output on a scatter chart.<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>fig, ax = plt.subplots()\nax.set_xticks(&#91;18.5, 24.9, 29.9], minor=False) #important values of BMI\nax.set_yticks(&#91;120, 130, 140, 180], minor=False) #important values of SysBP\nax.xaxis.grid(True, which='major',linewidth='0.5', color='red')\nax.yaxis.grid(True, which='major',linewidth='0.5', color='blue')\n\nplt.scatter(X&#91;:,1], y_pred, marker='.')\nplt.ylabel(\"Systolic blood pressure\")\nplt.xlabel(\"BMI\")\nplt.show()<\/code><\/pre>\n\n\n\n<p>In the above plot we added appropriate grid lines to represent the various categories of BMI and SYS BP.<\/p>\n\n\n\n<p>The entire program is as below:<\/p>\n\n\n\n<pre class=\"wp-block-code EnlighterJSRAW\"><code># Multiple Linear Regression\n# Importing the libraries\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\n# Importing the dataset\ndataset = pd.read_csv('fram1.csv')\ndataset = dataset&#91;~dataset&#91;'BMI'].isnull()] #ignoring a few null values from the dataset.\nX = dataset.iloc&#91;:, &#91;0,8,3,6]].values #select SEX,BMI,AGE,CURSMOKE #,3,6\ny = dataset.iloc&#91;:, 4].values #select SYSBP\n\n# Encoding categorical data\nfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder\nlabelencoder = LabelEncoder()\nX&#91;:, 0] = labelencoder.fit_transform(X&#91;:, 0])\n\n# Splitting the dataset into the Training set and Test set\nfrom sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)\n\n# Fitting Multiple Linear Regression to the Training set\nfrom sklearn.linear_model import LinearRegression\nregressor = LinearRegression()\nregressor.fit(X_train, y_train)\n\n# Predicting the Test set results\ny_pred = regressor.predict(X)\n\nprint 'R-squared: %.2f' % regressor.score(X_test, y_test)\nprint regressor.coef_\n\n\nfig, ax = plt.subplots()\nax.set_yticks(&#91;18.5, 24.9, 29.9], minor=False) #important values of BMI\nax.set_xticks(&#91;120, 130, 140, 180], minor=False) #important values of SysBP\nax.yaxis.grid(True, which='major',linewidth='0.5', color='red')\nax.xaxis.grid(True, which='major',linewidth='0.5', color='blue')\n\nplt.scatter(y_pred, X&#91;:,1], marker='.')\nplt.xlabel(\"Systolic blood pressure\")\nplt.ylabel(\"BMI\")\n\nplt.show()<\/code><\/pre>\n\n\n\n<p>We get the below two plots from our regression modals.&nbsp; We can see in the below plots how Age and BMI has an effect on the BMI.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_86.png\"><img loading=\"lazy\" decoding=\"async\" width=\"633\" height=\"472\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_86.png\" alt=\"\" class=\"wp-image-673\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_86.png 633w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_86-300x224.png 300w\" sizes=\"(max-width: 633px) 100vw, 633px\" \/><\/a><figcaption>Growth of SYS Blood Pressure with respect to the BMI. The blood pressure is high for category of people with higher BMI (over weight or obese)<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_87.png\"><img loading=\"lazy\" decoding=\"async\" width=\"634\" height=\"472\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_87.png\" alt=\"\" class=\"wp-image-674\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_87.png 634w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/06\/Snip20180603_87-300x223.png 300w\" sizes=\"(max-width: 634px) 100vw, 634px\" \/><\/a><figcaption>Growth of SYS BP with respect to a person&#8217;s Age.The blood pressure is higher in adults which are older than 50.<\/figcaption><\/figure><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Previously we built a simple linear regression model using a single explanatory variable to predict the price of pizza from its diameter. But in the real world the price of pizza cannot be entirely derived from the diameter of its base alone. It also depends on the toppings, which means there are a many more [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37,32],"tags":[49,58],"_links":{"self":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/661"}],"collection":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/comments?post=661"}],"version-history":[{"count":3,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/661\/revisions"}],"predecessor-version":[{"id":1896,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/661\/revisions\/1896"}],"wp:attachment":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media?parent=661"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/categories?post=661"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/tags?post=661"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}