{"id":1750,"date":"2021-05-07T02:11:57","date_gmt":"2021-05-07T02:11:57","guid":{"rendered":"http:\/\/192.168.31.181\/muthu\/?p=1750"},"modified":"2021-05-09T02:50:43","modified_gmt":"2021-05-09T02:50:43","slug":"understanding-correlations-and-correlation-matrix","status":"publish","type":"post","link":"http:\/\/write.muthu.co\/understanding-correlations-and-correlation-matrix\/","title":{"rendered":"Understanding Correlations and Correlation Matrix"},"content":{"rendered":"\n
Correlation is the measure of how two or more variables are related to one another, also referred to as linear dependence. An increase in demand for a product increases its price, also called the demand curve, traffic on roads at certain intervals of time of the day, the amount of rain correlates with grass fires, the examples are many. <\/p>\n\n\n\n
Correlation doesn’t imply causation<\/a>, even though the two variables have a linear dependence, one should not assume that one is affecting the other without proper hypothesis testing. Correlation will give you an exploratory overview of any dependence between variables in your dataset, their causation can only be understood after careful study. For example, women who are more educated tend to have lesser children. Women who are less educated tend to have more children, it’s a general observation. If you look at the population of developed and under-developed countries and look at their national education index, the two seem to be correlated but we can’t say education makes you produce lesser babies. So, correlation is best used as a suggestion rather than a technique that gives definitive answers. It is often a preparatory piece of analysis that gives some clues to what the data might yield, to be followed with other techniques like regression. <\/p>\n\n\n\n Two variables X and Y are positively correlated if high values of X go with high values of Y and low values of X go with lower values of Y. For Example:<\/p>\n\n\n\n Two variables are said to be negatively correlated if a high value of X goes with low values of Y and vice versa. For Example:<\/p>\n\n\n\n When X and Y have no relation, i.e a change in one variable doesn’t affect the other variable. <\/p>\n\n\n\n One of the ways to identify correlation is to look for visual cues in scatter plots. An increasing trend line indicates a positive correlation, while a decreasing trend line may indicate a negative correlation.<\/p>\n\n\n\n While the above method may work as a preliminary analysis but to get a concrete measure, we use something called a correlation coefficient to get the exact degree of correlation. <\/p>\n\n\n\n It gives an estimate of the correlation between two variables. For continuous variables, we usually use Pearson’s correlation coefficient. It is the covariance of the two variables divided by the product of their standard deviations. The value range from -1 (perfect negative correlation) to +1 (perfect positive correlation); 0 indicates no correlation. <\/p>\n\n\n\n Given a pair of random variables \\((X,Y) \\), the formula for Pearson Correlation Coefficient denoted by \\(\u03c1 \\) is:<\/p>\n\n\n\n \\(\ud835\udf0c = \\frac{cov(X,Y)}{\u03c3(X)\u03c3(Y)} \\) <\/p><\/blockquote>\n\n\n\n where:<\/p>\n\n\n\n \\(cov \\) is covariance The formula for covariance is:<\/p>\n\n\n\n \\(cov(X,Y) = \\sum_{i=1}^{n}(x_{i} – \\overline{x})(y_{i} – \\overline{y}) \\)<\/p><\/blockquote>\n\n\n\n Standard deviation is given by <\/p>\n\n\n\n \\(\\rho(X) = \\sqrt{\\sum_{i=1}^{n}(x_{i} – \\overline{x})^2} \\) which gives us the pearson correlation coefficient as:<\/p>\n\n\n\n \\(\\rho(X,Y) = \\frac{\\sum_{i=1}^{n}(x_{i} – \\overline{x})(y_{i} – \\overline{y})}{\\sqrt{\\sum_{i=1}^{n}(x_{i} – \\overline{x})^2}\\sqrt{\\sum_{i=1}^{n}(y_{i} – \\overline{y})^2}} \\)<\/p><\/blockquote>\n\n\n\n where, <\/p>\n\n\n\n n is the sample size. The formula can be rearranged in a more simplified format by simplifying the mean:<\/p>\n\n\n\n \\(\\rho(X,Y) = \\frac{n\\sum xy – \\sum x\\sum y}{\\sqrt{n\\sum x^2 – (\\sum x)^2} \\sqrt{n\\sum y^2 – (\\sum y)^2} } \\)<\/p>\n\n\n\n The value of the coefficient of correlation \u03c1 always ranges from -1 to +1. The correlation coefficient describes not only the magnitude of correlation but also its direction. +0.8 indicates that correlation is positive because the sign of \u03c1 is plus and the degree of correlation is high because the numerical value of \u03c1=0.8 is close to 1. If \u03c1=-0.4, it indicates that there is a low degree of negative correlation because the sigh of \u03c1 is negative and the numerical value of \u03c1 is less than 0.5<\/p>\n\n\n\n Note:<\/strong><\/p>\n\n\n\n The correlation coefficient is sensitive to outliers as you might have guessed already because of the use of mean in the formula. So, in exploratory data analysis, it is important to remove any outliers from the dataset before finding the correlation.<\/p>\n\n\n\n Let’s try to find correlation coefficient on a sample dataset<\/a>. A classic example is the correlation between the student’s GPA and their attendance in classroom. Looking at the scatter plot trendline we can assume there is a positive correlation here, so let’s try and find out the magnitude of correlation.<\/p>\n\n\n\n The dataset contains Attendance and GPA of 75 students in a class with the number of school equal to 180. The full dataset is given below.<\/p>\n\n\n\n Substituting the above values in our correlation coefficient formula:<\/p>\n\n\n\n \\(\\rho(X,Y) = \\frac{n\\sum XY – \\sum X\\sum Y}{\\sqrt{n\\sum X^2 – (\\sum X)^2} \\sqrt{n\\sum Y^2 – (\\sum Y)^2} } \\) <\/p>\n\n\n\n we get:<\/p>\n\n\n\n \\(\\rho(X,Y) = \\frac{75 \\times 700.8697 – 219.89 \\times 34646.7}{\\sqrt{75 \\times 700.8697 – 219.89^2} \\sqrt{75 \\times 1780033.0 – 11469^2} } = 0.84 \\)<\/p>\n\n\n\n Indicating a POSITIVE correlation. <\/p>\n\n\n\n There are many standard python libraries which can be used to calculate correlation, I will use the well known numpy library. Below code shows the calculations for the above dataset using formula as well numpy.<\/p>\n\n\n\n <\/p>\n\n\n\n When there are more than 2 variables and you want to understand how correlated all the variables are, we use a correlation matrix that gives us a single view of all correlations. A correlation matrix is nothing but a table showing correlation coefficients among your variables. Each cell in the table shows the correlation between two variables.<\/p>\n\n\n\n The below matrix shows the correlation among different constituents of wine in our dataset. <\/p>\n\n\n\n From the correlation matrix above we can make the following observations:<\/p>\n\n\n\n Code for the above analysis<\/p>\n\n\n\n [1] Height and Weight datasource -http:\/\/www.math.utah.edu\/~korevaar\/2270fall09\/project2\/htwts09.pdf [1] A Simple Study on Weight and Height of Students https:\/\/www.hindawi.com\/journals\/tswj\/2017\/7258607\/<\/a> <\/p>\n","protected":false},"excerpt":{"rendered":" Correlation is the measure of how two or more variables are related to one another, also referred to as linear dependence. An increase in demand for a product increases its price, also called the demand curve, traffic on roads at certain intervals of time of the day, the amount of rain correlates with grass fires, […]<\/p>\n","protected":false},"author":1,"featured_media":1779,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24],"tags":[66,65,67],"class_list":["post-1750","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","tag-correlation-matrix","tag-correlations","tag-pearsons-correlation-coefficient"],"_links":{"self":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1750","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/comments?post=1750"}],"version-history":[{"count":26,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1750\/revisions"}],"predecessor-version":[{"id":1796,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1750\/revisions\/1796"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media\/1779"}],"wp:attachment":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media?parent=1750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/categories?post=1750"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/tags?post=1750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}Positive and Negative Correlation<\/h2>\n\n\n\n
Positive Correlation<\/h3>\n\n\n\n
Negative Correlation<\/h3>\n\n\n\n
<\/figure>\n\n\n\n
No Correlation<\/h3>\n\n\n\n
Identifying Correlation<\/h2>\n\n\n\n
<\/figure>\n\n\n\n
Pearson’s Correlation coefficient<\/h3>\n\n\n\n
\\(\u03c3(X) \\) is the standard deviation of X
\\(\u03c3(Y) \\) is the standard deviation of Y<\/p><\/blockquote>\n\n\n\n
\\(\\rho(Y) = \\sqrt{\\sum_{i=1}^{n}(y_{i} – \\overline{y})^2} \\)<\/p><\/blockquote>\n\n\n\n<\/figure>\n\n\n\n
GPA (X)<\/th> Days Present (Y)<\/th> \\(X^2 \\)<\/th> \\(Y^2 \\)<\/th> \\(XY \\)<\/th><\/tr><\/thead> 4<\/td> 180<\/td> 16<\/td> 32400<\/td> 720<\/td><\/tr> 2.5<\/td> 150<\/td> 6.25<\/td> 22500<\/td> 375<\/td><\/tr> 4<\/td> 170<\/td> 16<\/td> 28900<\/td> 680<\/td><\/tr> 3.9<\/td> 180<\/td> 15.21<\/td> 32400<\/td> 702<\/td><\/tr> 3.75<\/td> 177<\/td> 14.0625<\/td> 31329<\/td> 663.75<\/td><\/tr> 3.8<\/td> 180<\/td> 14.44<\/td> 32400<\/td> 684<\/td><\/tr> 2.9<\/td> 140<\/td> 8.41<\/td> 19600<\/td> 406<\/td><\/tr> 3.1<\/td> 169<\/td> 9.61<\/td> 28561<\/td> 523.9<\/td><\/tr> 3.25<\/td> 168<\/td> 10.5625<\/td> 28224<\/td> 546<\/td><\/tr> 3.4<\/td> 152<\/td> 11.56<\/td> 23104<\/td> 516.8<\/td><\/tr> 3.3<\/td> 150<\/td> 10.89<\/td> 22500<\/td> 495<\/td><\/tr> 3.9<\/td> 170<\/td> 15.21<\/td> 28900<\/td> 663<\/td><\/tr> 1.35<\/td> 109<\/td> 1.8225<\/td> 11881<\/td> 147.15<\/td><\/tr> 4<\/td> 180<\/td> 16<\/td> 32400<\/td> 720<\/td><\/tr> 1<\/td> 108<\/td> 1<\/td> 11664<\/td> 108<\/td><\/tr> 3.85<\/td> 175<\/td> 14.8225<\/td> 30625<\/td> 673.75<\/td><\/tr> 2.98<\/td> 144<\/td> 8.8804<\/td> 20736<\/td> 429.12<\/td><\/tr> 2.75<\/td> 120<\/td> 7.5625<\/td> 14400<\/td> 330<\/td><\/tr> 2.75<\/td> 133<\/td> 7.5625<\/td> 17689<\/td> 365.75<\/td><\/tr> 3.6<\/td> 160<\/td> 12.96<\/td> 25600<\/td> 576<\/td><\/tr> 3.5<\/td> 160<\/td> 12.25<\/td> 25600<\/td> 560<\/td><\/tr> 3.5<\/td> 159<\/td> 12.25<\/td> 25281<\/td> 556.5<\/td><\/tr> 3.5<\/td> 165<\/td> 12.25<\/td> 27225<\/td> 577.5<\/td><\/tr> 3.85<\/td> 180<\/td> 14.8225<\/td> 32400<\/td> 693<\/td><\/tr> 2.95<\/td> 149<\/td> 8.7025<\/td> 22201<\/td> 439.55<\/td><\/tr> 3.95<\/td> 180<\/td> 15.6025<\/td> 32400<\/td> 711<\/td><\/tr> 3.65<\/td> 160<\/td> 13.3225<\/td> 25600<\/td> 584<\/td><\/tr> 3.55<\/td> 155<\/td> 12.6025<\/td> 24025<\/td> 550.25<\/td><\/tr> 3.58<\/td> 156<\/td> 12.8164<\/td> 24336<\/td> 558.48<\/td><\/tr> 2.98<\/td> 145<\/td> 8.8804<\/td> 21025<\/td> 432.1<\/td><\/tr> 1.5<\/td> 122<\/td> 2.25<\/td> 14884<\/td> 183<\/td><\/tr> 1.75<\/td> 131<\/td> 3.0625<\/td> 17161<\/td> 229.25<\/td><\/tr> 2.2<\/td> 156<\/td> 4.84<\/td> 24336<\/td> 343.2<\/td><\/tr> 3<\/td> 166<\/td> 9<\/td> 27556<\/td> 498<\/td><\/tr> 3<\/td> 170<\/td> 9<\/td> 28900<\/td> 510<\/td><\/tr> 3<\/td> 155<\/td> 9<\/td> 24025<\/td> 465<\/td><\/tr> 3.15<\/td> 158<\/td> 9.9225<\/td> 24964<\/td> 497.7<\/td><\/tr> 3.9<\/td> 170<\/td> 15.21<\/td> 28900<\/td> 663<\/td><\/tr> 3.15<\/td> 160<\/td> 9.9225<\/td> 25600<\/td> 504<\/td><\/tr> 3.85<\/td> 165<\/td> 14.8225<\/td> 27225<\/td> 635.25<\/td><\/tr> 2.7<\/td> 159<\/td> 7.29<\/td> 25281<\/td> 429.3<\/td><\/tr> 1<\/td> 119<\/td> 1<\/td> 14161<\/td> 119<\/td><\/tr> 3.25<\/td> 168<\/td> 10.5625<\/td> 28224<\/td> 546<\/td><\/tr> 3.9<\/td> 175<\/td> 15.21<\/td> 30625<\/td> 682.5<\/td><\/tr> 2.8<\/td> 161<\/td> 7.84<\/td> 25921<\/td> 450.8<\/td><\/tr> 3.5<\/td> 160<\/td> 12.25<\/td> 25600<\/td> 560<\/td><\/tr> 3.4<\/td> 160<\/td> 11.56<\/td> 25600<\/td> 544<\/td><\/tr> 2.3<\/td> 150<\/td> 5.29<\/td> 22500<\/td> 345<\/td><\/tr> 2.5<\/td> 140<\/td> 6.25<\/td> 19600<\/td> 350<\/td><\/tr> 2.35<\/td> 148<\/td> 5.5225<\/td> 21904<\/td> 347.8<\/td><\/tr> 2.95<\/td> 149<\/td> 8.7025<\/td> 22201<\/td> 439.55<\/td><\/tr> 3.55<\/td> 160<\/td> 12.6025<\/td> 25600<\/td> 568<\/td><\/tr> 3.6<\/td> 155<\/td> 12.96<\/td> 24025<\/td> 558<\/td><\/tr> 3.3<\/td> 166<\/td> 10.89<\/td> 27556<\/td> 547.8<\/td><\/tr> 3.85<\/td> 160<\/td> 14.8225<\/td> 25600<\/td> 616<\/td><\/tr> 3.95<\/td> 179<\/td> 15.6025<\/td> 32041<\/td> 707.05<\/td><\/tr> 2.95<\/td> 145<\/td> 8.7025<\/td> 21025<\/td> 427.75<\/td><\/tr> 2<\/td> 143<\/td> 4<\/td> 20449<\/td> 286<\/td><\/tr> 2<\/td> 145<\/td> 4<\/td> 21025<\/td> 290<\/td><\/tr> 1.75<\/td> 140<\/td> 3.0625<\/td> 19600<\/td> 245<\/td><\/tr> 1.5<\/td> 122<\/td> 2.25<\/td> 14884<\/td> 183<\/td><\/tr> 1.5<\/td> 125<\/td> 2.25<\/td> 15625<\/td> 187.5<\/td><\/tr> 1<\/td> 110<\/td> 1<\/td> 12100<\/td> 110<\/td><\/tr> 1.95<\/td> 120<\/td> 3.8025<\/td> 14400<\/td> 234<\/td><\/tr> 1.8<\/td> 165<\/td> 3.24<\/td> 27225<\/td> 297<\/td><\/tr> 2<\/td> 120<\/td> 4<\/td> 14400<\/td> 240<\/td><\/tr> 3.25<\/td> 171<\/td> 10.5625<\/td> 29241<\/td> 555.75<\/td><\/tr> 3.9<\/td> 160<\/td> 15.21<\/td> 25600<\/td> 624<\/td><\/tr> 2.15<\/td> 144<\/td> 4.6225<\/td> 20736<\/td> 309.6<\/td><\/tr> 2.5<\/td> 150<\/td> 6.25<\/td> 22500<\/td> 375<\/td><\/tr> 1.95<\/td> 149<\/td> 3.8025<\/td> 22201<\/td> 290.55<\/td><\/tr> 1<\/td> 120<\/td> 1<\/td> 14400<\/td> 120<\/td><\/tr> 3.95<\/td> 150<\/td> 15.6025<\/td> 22500<\/td> 592.5<\/td><\/tr> 2.75<\/td> 149<\/td> 7.5625<\/td> 22201<\/td> 409.75<\/td><\/tr> 3.5<\/td> 155<\/td> 12.25<\/td> 24025<\/td> 542.5<\/td><\/tr><\/tbody> \\(\\sum X = \\) 219.89<\/td> \\(\\sum Y = \\)11469.0<\/td> \\(\\sum X^2 = \\) 700.8697<\/td> \\(\\sum Y^2 = \\)1780033.0<\/td> \\(\\sum XY = \\)34646.7<\/td><\/tr><\/tfoot><\/table><\/figure>\n\n\n\n Correlation using python<\/h2>\n\n\n\n
import numpy as np\nimport math\nimport seaborn as sn\nimport matplotlib.pyplot as plt\n\n# setting seaborn as default chart\nsn.set()\n\n# dataset\ngpa_days = np.array([[4,180],[2.5,150],[4,170],[3.9,180],[3.75,177],[3.8,180],[2.9,140],[3.1,169],[3.25,168],[3.4,152],[3.3,150],[3.9,170],[1.35,109],[4,180],[1,108],[3.85,175],[2.98,144],[2.75,120],[2.75,133],[3.6,160],[3.5,160],[3.5,159],[3.5,165],[3.85,180],[2.95,149],[3.95,180],[3.65,160],[3.55,155],[3.58,156],[2.98,145],[1.5,122],[1.75,131],[2.2,156],[3,166],[3,170],[3,155],[3.15,158],[3.9,170],[3.15,160],[3.85,165],[2.7,159],[1,119],[3.25,168],[3.9,175],[2.8,161],[3.5,160],[3.4,160],[2.3,150],[2.5,140],[2.35,148],[2.95,149],[3.55,160],[3.6,155],[3.3,166],[3.85,160],[3.95,179],[2.95,145],[2,143],[2,145],[1.75,140],[1.5,122],[1.5,125],[1,110],[1.95,120],[1.8,165],[2,120],[3.25,171],[3.9,160],[2.15,144],[2.5,150],[1.95,149],[1,120],[3.95,150],[2.75,149],[3.5,155]])\n\n## finding correlation using pearson correlatoin formula\ntotal = len(gpa_days)\nsum_x = np.sum(gpa_days[:,0])\nsum_y = np.sum(gpa_days[:,1])\nsum_xx = np.sum(gpa_days[:,0]**2)\nsum_yy = np.sum(gpa_days[:,1]**2)\nsum_xy = np.sum(gpa_days[:,1]*gpa_days[:,0])\n\ncorrelation_p = (total*sum_xy - sum_x*sum_y)\/(math.sqrt(total*sum_xx - sum_x**2) * math.sqrt(total*sum_yy - sum_y**2))\n\nprint(\"correlation using formula:\",correlation_p)\n\nxy = [gpa_days[:,0],gpa_days[:,1]]\n\n# correlation using the numpy standard library \n# which internally uses pearsons correlation\ncorrelation_matrix = np.corrcoef(xy)\nprint(\"correlation using numpy:\",correlation_matrix[0][1])\n\n\nfig, ax = plt.subplots(ncols=2, figsize=(15,5))\n\n\nsn.heatmap(np.corrcoef(xy), color=\"k\", annot=True, ax=ax[1])\nax[1].set_title(\"correlation matrix\")\n\nsn.scatterplot(gpa_days[:,0], gpa_days[:,1], ax=ax[0], x=\"GPA\", y=\"Numbers of days attended (days)\")\nax[0].set_title(\"Attendance vs GPA dataset\")<\/code><\/pre>\n\n\n\n
Correlation Matrix<\/h2>\n\n\n\n
import pandas as pd\nimport seaborn as sn\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_wine\n\n# exploring wine dataset\ndf = pd.read_csv('https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/wine-quality\/winequality-white.csv', sep=';')\nprint(df.head())\nplt.figure(figsize=(20,20))\nplt.title(\"Correlation Matrix Wine Dataset\")\nsn.heatmap(df.corr(), color=\"k\", annot=True, cmap=\"YlGnBu\")<\/code><\/pre>\n\n\n\n
Key Ideas<\/h2>\n\n\n\n
Data Sources:<\/h2>\n\n\n\n
[2] Wine Dataset – https:\/\/archive.ics.uci.edu\/ml\/datasets\/wine<\/p>\n\n\n\nReferences:<\/h2>\n\n\n\n
[2] https:\/\/blogs.worldbank.org\/health\/female-education-and-childbearing-closer-look-data<\/a>
[3] https:\/\/wol.iza.org\/uploads\/articles\/228\/pdfs\/female-education-and-its-impact-on-fertility.pdf<\/a>
[4] Becker, G S and G H Lewis (1973), \u201cOn the interaction between the quantity and quality of children\u201d, Journal of Political Economy 81: S279\u2013S288.
[5] Galor, O and D N Weil (1996), \u201cThe gender gap, fertility, and growth\u201d, American Economic Review86(3): 374\u2013387.
[6] Galor, O and D N Weil (2000), \u201cPopulation, technology, and growth: From Malthusian stagnation to the demographic transition and beyond\u201d, American Economic Review 90(4): 806\u2013828.
[7] https:\/\/en.wikipedia.org\/wiki\/Pearson_correlation_coefficient<\/p>\n\n\n\n