{"id":1750,"date":"2021-05-07T02:11:57","date_gmt":"2021-05-07T02:11:57","guid":{"rendered":"http:\/\/192.168.31.181\/muthu\/?p=1750"},"modified":"2021-05-09T02:50:43","modified_gmt":"2021-05-09T02:50:43","slug":"understanding-correlations-and-correlation-matrix","status":"publish","type":"post","link":"http:\/\/write.muthu.co\/understanding-correlations-and-correlation-matrix\/","title":{"rendered":"Understanding Correlations and Correlation Matrix"},"content":{"rendered":"\n<p>Correlation is the measure of how two or more variables are related to one another, also referred to as linear dependence. An increase in demand for a product increases its price, also called the demand curve, traffic on roads at certain intervals of time of the day, the amount of rain correlates with grass fires, the examples are many. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Causation<\/strong><\/h2>\n\n\n\n<p><a rel=\"noreferrer noopener\" href=\"https:\/\/en.wikipedia.org\/wiki\/Correlation_does_not_imply_causation\" target=\"_blank\">Correlation doesn&#8217;t imply causation<\/a>, even though the two variables have a linear dependence, one should not assume that one is affecting the other without proper hypothesis testing. Correlation will give you an exploratory overview of any dependence between variables in your dataset, their causation can only be understood after careful study. For example, women who are more educated tend to have lesser children. Women who are less educated tend to have more children, it&#8217;s a general observation. If you look at the population of developed and under-developed countries and look at their national education index, the two seem to be correlated but we can&#8217;t say education makes you produce lesser babies. So, correlation is best used as a suggestion rather than a technique that gives definitive answers. It is often a preparatory piece of analysis that gives some clues to what the data might yield, to be followed with other techniques like regression. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Positive and Negative Correlation<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Positive Correlation<\/h3>\n\n\n\n<p>Two variables X and Y are positively correlated if high values of X go with high values of Y and low values of X go with lower values of Y. For Example:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Height and Weight &#8211;  Taller people are generally heavier. But many shorter ones are heavy (Correlation doesn&#8217;t imply Causation). The cause of this behavior cannot be associated with height alone. <\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"891\" height=\"604\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/download.png\" alt=\"\" class=\"wp-image-1752\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download.png 891w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-300x203.png 300w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-768x521.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-780x529.png 780w\" sizes=\"auto, (max-width: 891px) 100vw, 891px\" \/><figcaption>Weight and Height Correlation <sup>[1]<\/sup><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Negative Correlation<\/h3>\n\n\n\n<p>Two variables are said to be negatively correlated if a high value of X goes with low values of Y and vice versa. For Example:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>More educated women tend to have lesser children. This doesn&#8217;t mean that more education causes women to have lesser children, it&#8217;s usually caused by many factors which may not be the same in different countries. There are many socio-economic factors that show a strong positive correlation between more education and fertility [4] [5] [6], one article will not be enough to cover the entire scope of this research.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"748\" height=\"550\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/Snip20210505_3.png\" alt=\"\" class=\"wp-image-1755\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_3.png 748w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_3-300x221.png 300w\" sizes=\"auto, (max-width: 748px) 100vw, 748px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">No Correlation<\/h3>\n\n\n\n<p>When X and Y have no relation, i.e a change in one variable doesn&#8217;t affect the other variable. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Identifying Correlation<\/h2>\n\n\n\n<p>One of the ways to identify correlation is to look for visual cues in scatter plots. An increasing trend line indicates a positive correlation, while a decreasing trend line may indicate a negative correlation.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"719\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/Snip20210505_4-1024x719.png\" alt=\"\" class=\"wp-image-1756\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_4-1024x719.png 1024w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_4-300x211.png 300w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_4-768x540.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_4-780x548.png 780w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/Snip20210505_4.png 1150w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>While the above method may work as a preliminary analysis but to get a concrete measure, we use something called a correlation coefficient to get the exact degree of correlation. <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pearson&#8217;s Correlation coefficient<\/h3>\n\n\n\n<p>It gives an estimate of the correlation between two variables. For continuous variables, we usually use Pearson&#8217;s correlation coefficient. It is the&nbsp;covariance&nbsp;of the two variables divided by the product of their&nbsp;standard deviations. The value range from -1 (perfect negative correlation) to +1 (perfect positive correlation); 0 indicates no correlation. <\/p>\n\n\n\n<p>Given a pair of random variables \\((X,Y) \\), the formula for Pearson Correlation Coefficient denoted by \\(\u03c1 \\) is:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-style-large is-layout-flow wp-block-quote-is-layout-flow\"><p>\\(\ud835\udf0c = \\frac{cov(X,Y)}{\u03c3(X)\u03c3(Y)} \\) <\/p><\/blockquote>\n\n\n\n<p>where:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>\\(cov \\) is covariance<br>\\(\u03c3(X) \\) is the&nbsp;standard deviation&nbsp;of&nbsp;X<br>\\(\u03c3(Y) \\) is the standard deviation of Y<\/p><\/blockquote>\n\n\n\n<p>The formula for covariance is:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>\\(cov(X,Y) = \\sum_{i=1}^{n}(x_{i} &#8211; \\overline{x})(y_{i} &#8211; \\overline{y}) \\)<\/p><\/blockquote>\n\n\n\n<p>Standard deviation is given by <\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>\\(\\rho(X) = \\sqrt{\\sum_{i=1}^{n}(x_{i} &#8211; \\overline{x})^2} \\)<br>\\(\\rho(Y) = \\sqrt{\\sum_{i=1}^{n}(y_{i} &#8211; \\overline{y})^2} \\)<\/p><\/blockquote>\n\n\n\n<p>which gives us the pearson correlation coefficient as:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>\\(\\rho(X,Y) = \\frac{\\sum_{i=1}^{n}(x_{i} &#8211; \\overline{x})(y_{i} &#8211; \\overline{y})}{\\sqrt{\\sum_{i=1}^{n}(x_{i} &#8211; \\overline{x})^2}\\sqrt{\\sum_{i=1}^{n}(y_{i} &#8211; \\overline{y})^2}} \\)<\/p><\/blockquote>\n\n\n\n<p>where, <\/p>\n\n\n\n<p>n is the sample size. The formula can be rearranged in a more simplified format by simplifying the mean:<\/p>\n\n\n\n<p>\\(\\rho(X,Y) = \\frac{n\\sum xy &#8211; \\sum x\\sum y}{\\sqrt{n\\sum x^2 &#8211; (\\sum x)^2} \\sqrt{n\\sum y^2 &#8211; (\\sum y)^2} } \\)<\/p>\n\n\n\n<p>The value of the coefficient of correlation \u03c1 always ranges from -1 to +1. The correlation coefficient describes not only the magnitude of correlation but also its direction. +0.8 indicates that correlation is positive because the sign of \u03c1 is plus and the degree of correlation is high because the numerical value of \u03c1=0.8 is close to 1. If \u03c1=-0.4, it indicates that there is a low degree of negative correlation because the sigh of \u03c1 is negative and the numerical value of \u03c1 is less than 0.5<\/p>\n\n\n\n<p><strong>Note:<\/strong><\/p>\n\n\n\n<p>The correlation coefficient is sensitive to outliers as you might have guessed already because of the use of mean in the formula. So, in exploratory data analysis, it is important to remove any outliers from the dataset before finding the correlation.<\/p>\n\n\n\n<p>Let&#8217;s try to find correlation coefficient on a<a rel=\"noreferrer noopener\" href=\"https:\/\/docs.google.com\/spreadsheets\/d\/1HfdnSTFFcr-xNHudTd2SWcmE7zXkvgp8ma5yFvjMybA\/edit?usp=sharing\" target=\"_blank\"> sample dataset<\/a>. A classic example is the correlation between the student&#8217;s GPA and their attendance in classroom. Looking at the scatter plot trendline we can assume there is a positive correlation here, so let&#8217;s try and find out the magnitude of correlation.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/image-1024x523.png\" alt=\"\" class=\"wp-image-1770\" width=\"577\" height=\"295\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1024x523.png 1024w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-300x153.png 300w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-768x392.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1536x784.png 1536w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1600x817.png 1600w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-780x398.png 780w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image.png 1900w\" sizes=\"auto, (max-width: 577px) 100vw, 577px\" \/><\/figure>\n\n\n\n<p>The dataset contains Attendance and GPA of 75 students in a class with the number of school equal to 180. The full dataset is given below.<\/p>\n\n\n\n<figure class=\"wp-block-table is-style-regular\"><table class=\"has-fixed-layout\"><thead><tr><th>GPA (X)<\/th><th>Days Present (Y)<\/th><th>\\(X^2 \\)<\/th><th>\\(Y^2 \\)<\/th><th>\\(XY \\)<\/th><\/tr><\/thead><tbody><tr><td>4<\/td><td>180<\/td><td>16<\/td><td>32400<\/td><td>720<\/td><\/tr><tr><td>2.5<\/td><td>150<\/td><td>6.25<\/td><td>22500<\/td><td>375<\/td><\/tr><tr><td>4<\/td><td>170<\/td><td>16<\/td><td>28900<\/td><td>680<\/td><\/tr><tr><td>3.9<\/td><td>180<\/td><td>15.21<\/td><td>32400<\/td><td>702<\/td><\/tr><tr><td>3.75<\/td><td>177<\/td><td>14.0625<\/td><td>31329<\/td><td>663.75<\/td><\/tr><tr><td>3.8<\/td><td>180<\/td><td>14.44<\/td><td>32400<\/td><td>684<\/td><\/tr><tr><td>2.9<\/td><td>140<\/td><td>8.41<\/td><td>19600<\/td><td>406<\/td><\/tr><tr><td>3.1<\/td><td>169<\/td><td>9.61<\/td><td>28561<\/td><td>523.9<\/td><\/tr><tr><td>3.25<\/td><td>168<\/td><td>10.5625<\/td><td>28224<\/td><td>546<\/td><\/tr><tr><td>3.4<\/td><td>152<\/td><td>11.56<\/td><td>23104<\/td><td>516.8<\/td><\/tr><tr><td>3.3<\/td><td>150<\/td><td>10.89<\/td><td>22500<\/td><td>495<\/td><\/tr><tr><td>3.9<\/td><td>170<\/td><td>15.21<\/td><td>28900<\/td><td>663<\/td><\/tr><tr><td>1.35<\/td><td>109<\/td><td>1.8225<\/td><td>11881<\/td><td>147.15<\/td><\/tr><tr><td>4<\/td><td>180<\/td><td>16<\/td><td>32400<\/td><td>720<\/td><\/tr><tr><td>1<\/td><td>108<\/td><td>1<\/td><td>11664<\/td><td>108<\/td><\/tr><tr><td>3.85<\/td><td>175<\/td><td>14.8225<\/td><td>30625<\/td><td>673.75<\/td><\/tr><tr><td>2.98<\/td><td>144<\/td><td>8.8804<\/td><td>20736<\/td><td>429.12<\/td><\/tr><tr><td>2.75<\/td><td>120<\/td><td>7.5625<\/td><td>14400<\/td><td>330<\/td><\/tr><tr><td>2.75<\/td><td>133<\/td><td>7.5625<\/td><td>17689<\/td><td>365.75<\/td><\/tr><tr><td>3.6<\/td><td>160<\/td><td>12.96<\/td><td>25600<\/td><td>576<\/td><\/tr><tr><td>3.5<\/td><td>160<\/td><td>12.25<\/td><td>25600<\/td><td>560<\/td><\/tr><tr><td>3.5<\/td><td>159<\/td><td>12.25<\/td><td>25281<\/td><td>556.5<\/td><\/tr><tr><td>3.5<\/td><td>165<\/td><td>12.25<\/td><td>27225<\/td><td>577.5<\/td><\/tr><tr><td>3.85<\/td><td>180<\/td><td>14.8225<\/td><td>32400<\/td><td>693<\/td><\/tr><tr><td>2.95<\/td><td>149<\/td><td>8.7025<\/td><td>22201<\/td><td>439.55<\/td><\/tr><tr><td>3.95<\/td><td>180<\/td><td>15.6025<\/td><td>32400<\/td><td>711<\/td><\/tr><tr><td>3.65<\/td><td>160<\/td><td>13.3225<\/td><td>25600<\/td><td>584<\/td><\/tr><tr><td>3.55<\/td><td>155<\/td><td>12.6025<\/td><td>24025<\/td><td>550.25<\/td><\/tr><tr><td>3.58<\/td><td>156<\/td><td>12.8164<\/td><td>24336<\/td><td>558.48<\/td><\/tr><tr><td>2.98<\/td><td>145<\/td><td>8.8804<\/td><td>21025<\/td><td>432.1<\/td><\/tr><tr><td>1.5<\/td><td>122<\/td><td>2.25<\/td><td>14884<\/td><td>183<\/td><\/tr><tr><td>1.75<\/td><td>131<\/td><td>3.0625<\/td><td>17161<\/td><td>229.25<\/td><\/tr><tr><td>2.2<\/td><td>156<\/td><td>4.84<\/td><td>24336<\/td><td>343.2<\/td><\/tr><tr><td>3<\/td><td>166<\/td><td>9<\/td><td>27556<\/td><td>498<\/td><\/tr><tr><td>3<\/td><td>170<\/td><td>9<\/td><td>28900<\/td><td>510<\/td><\/tr><tr><td>3<\/td><td>155<\/td><td>9<\/td><td>24025<\/td><td>465<\/td><\/tr><tr><td>3.15<\/td><td>158<\/td><td>9.9225<\/td><td>24964<\/td><td>497.7<\/td><\/tr><tr><td>3.9<\/td><td>170<\/td><td>15.21<\/td><td>28900<\/td><td>663<\/td><\/tr><tr><td>3.15<\/td><td>160<\/td><td>9.9225<\/td><td>25600<\/td><td>504<\/td><\/tr><tr><td>3.85<\/td><td>165<\/td><td>14.8225<\/td><td>27225<\/td><td>635.25<\/td><\/tr><tr><td>2.7<\/td><td>159<\/td><td>7.29<\/td><td>25281<\/td><td>429.3<\/td><\/tr><tr><td>1<\/td><td>119<\/td><td>1<\/td><td>14161<\/td><td>119<\/td><\/tr><tr><td>3.25<\/td><td>168<\/td><td>10.5625<\/td><td>28224<\/td><td>546<\/td><\/tr><tr><td>3.9<\/td><td>175<\/td><td>15.21<\/td><td>30625<\/td><td>682.5<\/td><\/tr><tr><td>2.8<\/td><td>161<\/td><td>7.84<\/td><td>25921<\/td><td>450.8<\/td><\/tr><tr><td>3.5<\/td><td>160<\/td><td>12.25<\/td><td>25600<\/td><td>560<\/td><\/tr><tr><td>3.4<\/td><td>160<\/td><td>11.56<\/td><td>25600<\/td><td>544<\/td><\/tr><tr><td>2.3<\/td><td>150<\/td><td>5.29<\/td><td>22500<\/td><td>345<\/td><\/tr><tr><td>2.5<\/td><td>140<\/td><td>6.25<\/td><td>19600<\/td><td>350<\/td><\/tr><tr><td>2.35<\/td><td>148<\/td><td>5.5225<\/td><td>21904<\/td><td>347.8<\/td><\/tr><tr><td>2.95<\/td><td>149<\/td><td>8.7025<\/td><td>22201<\/td><td>439.55<\/td><\/tr><tr><td>3.55<\/td><td>160<\/td><td>12.6025<\/td><td>25600<\/td><td>568<\/td><\/tr><tr><td>3.6<\/td><td>155<\/td><td>12.96<\/td><td>24025<\/td><td>558<\/td><\/tr><tr><td>3.3<\/td><td>166<\/td><td>10.89<\/td><td>27556<\/td><td>547.8<\/td><\/tr><tr><td>3.85<\/td><td>160<\/td><td>14.8225<\/td><td>25600<\/td><td>616<\/td><\/tr><tr><td>3.95<\/td><td>179<\/td><td>15.6025<\/td><td>32041<\/td><td>707.05<\/td><\/tr><tr><td>2.95<\/td><td>145<\/td><td>8.7025<\/td><td>21025<\/td><td>427.75<\/td><\/tr><tr><td>2<\/td><td>143<\/td><td>4<\/td><td>20449<\/td><td>286<\/td><\/tr><tr><td>2<\/td><td>145<\/td><td>4<\/td><td>21025<\/td><td>290<\/td><\/tr><tr><td>1.75<\/td><td>140<\/td><td>3.0625<\/td><td>19600<\/td><td>245<\/td><\/tr><tr><td>1.5<\/td><td>122<\/td><td>2.25<\/td><td>14884<\/td><td>183<\/td><\/tr><tr><td>1.5<\/td><td>125<\/td><td>2.25<\/td><td>15625<\/td><td>187.5<\/td><\/tr><tr><td>1<\/td><td>110<\/td><td>1<\/td><td>12100<\/td><td>110<\/td><\/tr><tr><td>1.95<\/td><td>120<\/td><td>3.8025<\/td><td>14400<\/td><td>234<\/td><\/tr><tr><td>1.8<\/td><td>165<\/td><td>3.24<\/td><td>27225<\/td><td>297<\/td><\/tr><tr><td>2<\/td><td>120<\/td><td>4<\/td><td>14400<\/td><td>240<\/td><\/tr><tr><td>3.25<\/td><td>171<\/td><td>10.5625<\/td><td>29241<\/td><td>555.75<\/td><\/tr><tr><td>3.9<\/td><td>160<\/td><td>15.21<\/td><td>25600<\/td><td>624<\/td><\/tr><tr><td>2.15<\/td><td>144<\/td><td>4.6225<\/td><td>20736<\/td><td>309.6<\/td><\/tr><tr><td>2.5<\/td><td>150<\/td><td>6.25<\/td><td>22500<\/td><td>375<\/td><\/tr><tr><td>1.95<\/td><td>149<\/td><td>3.8025<\/td><td>22201<\/td><td>290.55<\/td><\/tr><tr><td>1<\/td><td>120<\/td><td>1<\/td><td>14400<\/td><td>120<\/td><\/tr><tr><td>3.95<\/td><td>150<\/td><td>15.6025<\/td><td>22500<\/td><td>592.5<\/td><\/tr><tr><td>2.75<\/td><td>149<\/td><td>7.5625<\/td><td>22201<\/td><td>409.75<\/td><\/tr><tr><td>3.5<\/td><td>155<\/td><td>12.25<\/td><td>24025<\/td><td>542.5<\/td><\/tr><\/tbody><tfoot><tr><td>\\(\\sum X = \\) 219.89<\/td><td>\\(\\sum Y = \\)11469.0<\/td><td>\\(\\sum X^2 = \\) 700.8697<\/td><td>\\(\\sum Y^2 = \\)1780033.0<\/td><td>\\(\\sum XY = \\)34646.7<\/td><\/tr><\/tfoot><\/table><\/figure>\n\n\n\n<p>Substituting the above values in our correlation coefficient formula:<\/p>\n\n\n\n<p>\\(\\rho(X,Y) = \\frac{n\\sum XY &#8211; \\sum X\\sum Y}{\\sqrt{n\\sum X^2 &#8211; (\\sum X)^2} \\sqrt{n\\sum Y^2 &#8211; (\\sum Y)^2} } \\) <\/p>\n\n\n\n<p>we get:<\/p>\n\n\n\n<p>\\(\\rho(X,Y) = \\frac{75 \\times 700.8697 &#8211; 219.89 \\times 34646.7}{\\sqrt{75 \\times 700.8697 &#8211; 219.89^2} \\sqrt{75 \\times 1780033.0  &#8211; 11469^2} } = 0.84 \\)<\/p>\n\n\n\n<p>Indicating a POSITIVE correlation. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Correlation using python<\/h2>\n\n\n\n<p>There are many standard python libraries which can be used to calculate correlation, I will use the well known numpy library. Below code shows the calculations for the above dataset using formula as well numpy.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np\nimport math\nimport seaborn as sn\nimport matplotlib.pyplot as plt\n\n# setting seaborn as default chart\nsn.set()\n\n# dataset\ngpa_days = np.array(&#091;&#091;4,180],&#091;2.5,150],&#091;4,170],&#091;3.9,180],&#091;3.75,177],&#091;3.8,180],&#091;2.9,140],&#091;3.1,169],&#091;3.25,168],&#091;3.4,152],&#091;3.3,150],&#091;3.9,170],&#091;1.35,109],&#091;4,180],&#091;1,108],&#091;3.85,175],&#091;2.98,144],&#091;2.75,120],&#091;2.75,133],&#091;3.6,160],&#091;3.5,160],&#091;3.5,159],&#091;3.5,165],&#091;3.85,180],&#091;2.95,149],&#091;3.95,180],&#091;3.65,160],&#091;3.55,155],&#091;3.58,156],&#091;2.98,145],&#091;1.5,122],&#091;1.75,131],&#091;2.2,156],&#091;3,166],&#091;3,170],&#091;3,155],&#091;3.15,158],&#091;3.9,170],&#091;3.15,160],&#091;3.85,165],&#091;2.7,159],&#091;1,119],&#091;3.25,168],&#091;3.9,175],&#091;2.8,161],&#091;3.5,160],&#091;3.4,160],&#091;2.3,150],&#091;2.5,140],&#091;2.35,148],&#091;2.95,149],&#091;3.55,160],&#091;3.6,155],&#091;3.3,166],&#091;3.85,160],&#091;3.95,179],&#091;2.95,145],&#091;2,143],&#091;2,145],&#091;1.75,140],&#091;1.5,122],&#091;1.5,125],&#091;1,110],&#091;1.95,120],&#091;1.8,165],&#091;2,120],&#091;3.25,171],&#091;3.9,160],&#091;2.15,144],&#091;2.5,150],&#091;1.95,149],&#091;1,120],&#091;3.95,150],&#091;2.75,149],&#091;3.5,155]])\n\n## finding correlation using pearson correlatoin formula\ntotal = len(gpa_days)\nsum_x = np.sum(gpa_days&#091;:,0])\nsum_y = np.sum(gpa_days&#091;:,1])\nsum_xx = np.sum(gpa_days&#091;:,0]**2)\nsum_yy = np.sum(gpa_days&#091;:,1]**2)\nsum_xy = np.sum(gpa_days&#091;:,1]*gpa_days&#091;:,0])\n\ncorrelation_p = (total*sum_xy - sum_x*sum_y)\/(math.sqrt(total*sum_xx - sum_x**2) * math.sqrt(total*sum_yy - sum_y**2))\n\nprint(\"correlation using formula:\",correlation_p)\n\nxy = &#091;gpa_days&#091;:,0],gpa_days&#091;:,1]]\n\n# correlation using the numpy standard library \n# which internally uses pearsons correlation\ncorrelation_matrix = np.corrcoef(xy)\nprint(\"correlation using numpy:\",correlation_matrix&#091;0]&#091;1])\n\n\nfig, ax = plt.subplots(ncols=2, figsize=(15,5))\n\n\nsn.heatmap(np.corrcoef(xy), color=\"k\", annot=True, ax=ax&#091;1])\nax&#091;1].set_title(\"correlation matrix\")\n\nsn.scatterplot(gpa_days&#091;:,0], gpa_days&#091;:,1], ax=ax&#091;0], x=\"GPA\", y=\"Numbers of days attended (days)\")\nax&#091;0].set_title(\"Attendance vs GPA dataset\")<\/code><\/pre>\n\n\n\n<p><\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/download-1.png\" alt=\"student data and its correlation matrix\" class=\"wp-image-1778\" width=\"726\" height=\"267\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-1.png 872w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-1-300x110.png 300w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-1-768x283.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/download-1-780x287.png 780w\" sizes=\"auto, (max-width: 726px) 100vw, 726px\" \/><figcaption>Output of the python code<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Correlation Matrix<\/h2>\n\n\n\n<p>When there are more than 2 variables and you want to understand how correlated all the variables are, we use a correlation matrix that gives us a single view of all correlations. A correlation matrix is nothing but a table showing correlation coefficients among your variables. Each cell in the table shows the correlation between two variables.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/image-1-1024x234.png\" alt=\"\" class=\"wp-image-1780\" width=\"729\" height=\"166\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1-1024x234.png 1024w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1-300x69.png 300w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1-768x175.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1-1536x351.png 1536w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1-1600x365.png 1600w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1-780x178.png 780w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/image-1.png 1970w\" sizes=\"auto, (max-width: 729px) 100vw, 729px\" \/><figcaption>Wine Dataset Sample: The data snapshot above is the result of chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.<\/figcaption><\/figure>\n\n\n\n<p>The below matrix shows the correlation among different constituents of wine in our dataset. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/05\/wine-dataset-correlation-matrix-971x1024.png\" alt=\"\" class=\"wp-image-1779\" width=\"728\" height=\"767\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/wine-dataset-correlation-matrix-971x1024.png 971w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/wine-dataset-correlation-matrix-284x300.png 284w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/wine-dataset-correlation-matrix-768x810.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/wine-dataset-correlation-matrix-780x823.png 780w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/05\/wine-dataset-correlation-matrix.png 1166w\" sizes=\"auto, (max-width: 728px) 100vw, 728px\" \/><figcaption>Correlation matrix<\/figcaption><\/figure>\n\n\n\n<p>From the correlation matrix above we can make the following observations:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><em>density<\/em>&nbsp;has a strong positive correlation with&nbsp;<em>residual sugar,<\/em>&nbsp;whereas it has a strong negative correlation with&nbsp;<em>alcohol<\/em>.<\/li><li>pH &amp; fixed acidity has a negative correlation.<\/li><li>density &amp; fixed acidity has a positive correlation.<\/li><li>citric acid &amp; fixed acidity has a positive correlation.<\/li><li>citric acid &amp; volatile acidity has a negative correlation.<\/li><li>free sulfur dioxide &amp; total sulfur dioxide has a positive correlation.<\/li><\/ul>\n\n\n\n<p>Code for the above analysis<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\nimport seaborn as sn\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_wine\n\n# exploring wine dataset\ndf = pd.read_csv('https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/wine-quality\/winequality-white.csv', sep=';')\nprint(df.head())\nplt.figure(figsize=(20,20))\nplt.title(\"Correlation Matrix Wine Dataset\")\nsn.heatmap(df.corr(), color=\"k\", annot=True, cmap=\"YlGnBu\")<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Key Ideas<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>The correlation coefficient measures the extent to which two pairs of variables are related to each other. <\/li><li>Scatter plots are used to get a visual understanding of correlation.<\/li><li>Correlation Matrix can be used to get a snapshot of the relationship between more than two variables in a tabular format.<\/li><li>The correlation coefficient is a standardized metric that ranges from -1 and +1. +ve values indicate a positive correlation. -ve values indicate a negative correlation. 0 indicates no correlation.<\/li><\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Data Sources:<\/h2>\n\n\n\n<p class=\"has-small-font-size\">[1] Height and Weight datasource -http:\/\/www.math.utah.edu\/~korevaar\/2270fall09\/project2\/htwts09.pdf<br>[2] Wine Dataset &#8211; https:\/\/archive.ics.uci.edu\/ml\/datasets\/wine<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References:<\/h2>\n\n\n\n<p class=\"has-small-font-size\">[1] A Simple Study on Weight and Height of Students <a rel=\"noreferrer noopener\" href=\"https:\/\/core.ac.uk\/download\/pdf\/205439938.pdf\" target=\"_blank\">https:\/\/www.hindawi.com\/journals\/tswj\/2017\/7258607\/<\/a><br>[2] <a rel=\"noreferrer noopener\" href=\"https:\/\/blogs.worldbank.org\/health\/female-education-and-childbearing-closer-look-data\" target=\"_blank\">https:\/\/blogs.worldbank.org\/health\/female-education-and-childbearing-closer-look-data<\/a><br>[3] <a rel=\"noreferrer noopener\" href=\"https:\/\/wol.iza.org\/uploads\/articles\/228\/pdfs\/female-education-and-its-impact-on-fertility.pdf\" target=\"_blank\">https:\/\/wol.iza.org\/uploads\/articles\/228\/pdfs\/female-education-and-its-impact-on-fertility.pdf<\/a><br>[4] Becker, G S and G H Lewis (1973), \u201cOn the interaction between the quantity and quality of children\u201d, Journal of Political Economy 81: S279\u2013S288.<br>[5] Galor, O and D N Weil (1996), \u201cThe gender gap, fertility, and growth\u201d, American Economic Review86(3): 374\u2013387.<br>[6] Galor, O and D N Weil (2000), \u201cPopulation, technology, and growth: From Malthusian stagnation to the demographic transition and beyond\u201d, American Economic Review 90(4): 806\u2013828.<br>[7] https:\/\/en.wikipedia.org\/wiki\/Pearson_correlation_coefficient<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Correlation is the measure of how two or more variables are related to one another, also referred to as linear dependence. An increase in demand for a product increases its price, also called the demand curve, traffic on roads at certain intervals of time of the day, the amount of rain correlates with grass fires, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1779,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24],"tags":[66,65,67],"class_list":["post-1750","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","tag-correlation-matrix","tag-correlations","tag-pearsons-correlation-coefficient"],"_links":{"self":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1750","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/comments?post=1750"}],"version-history":[{"count":26,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1750\/revisions"}],"predecessor-version":[{"id":1796,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1750\/revisions\/1796"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media\/1779"}],"wp:attachment":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media?parent=1750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/categories?post=1750"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/tags?post=1750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}