\\(\\sum X = \\) 219.89<\/td> \\(\\sum Y = \\)11469.0<\/td> \\(\\sum X^2 = \\) 700.8697<\/td> \\(\\sum Y^2 = \\)1780033.0<\/td> \\(\\sum XY = \\)34646.7<\/td><\/tr><\/tfoot><\/table><\/figure>\n\n\n\nSubstituting the above values in our correlation coefficient formula:<\/p>\n\n\n\n
\\(\\rho(X,Y) = \\frac{n\\sum XY – \\sum X\\sum Y}{\\sqrt{n\\sum X^2 – (\\sum X)^2} \\sqrt{n\\sum Y^2 – (\\sum Y)^2} } \\) <\/p>\n\n\n\n
we get:<\/p>\n\n\n\n
\\(\\rho(X,Y) = \\frac{75 \\times 700.8697 – 219.89 \\times 34646.7}{\\sqrt{75 \\times 700.8697 – 219.89^2} \\sqrt{75 \\times 1780033.0 – 11469^2} } = 0.84 \\)<\/p>\n\n\n\n
Indicating a POSITIVE correlation. <\/p>\n\n\n\n
Correlation using python<\/h2>\n\n\n\n There are many standard python libraries which can be used to calculate correlation, I will use the well known numpy library. Below code shows the calculations for the above dataset using formula as well numpy.<\/p>\n\n\n\n
import numpy as np\nimport math\nimport seaborn as sn\nimport matplotlib.pyplot as plt\n\n# setting seaborn as default chart\nsn.set()\n\n# dataset\ngpa_days = np.array([[4,180],[2.5,150],[4,170],[3.9,180],[3.75,177],[3.8,180],[2.9,140],[3.1,169],[3.25,168],[3.4,152],[3.3,150],[3.9,170],[1.35,109],[4,180],[1,108],[3.85,175],[2.98,144],[2.75,120],[2.75,133],[3.6,160],[3.5,160],[3.5,159],[3.5,165],[3.85,180],[2.95,149],[3.95,180],[3.65,160],[3.55,155],[3.58,156],[2.98,145],[1.5,122],[1.75,131],[2.2,156],[3,166],[3,170],[3,155],[3.15,158],[3.9,170],[3.15,160],[3.85,165],[2.7,159],[1,119],[3.25,168],[3.9,175],[2.8,161],[3.5,160],[3.4,160],[2.3,150],[2.5,140],[2.35,148],[2.95,149],[3.55,160],[3.6,155],[3.3,166],[3.85,160],[3.95,179],[2.95,145],[2,143],[2,145],[1.75,140],[1.5,122],[1.5,125],[1,110],[1.95,120],[1.8,165],[2,120],[3.25,171],[3.9,160],[2.15,144],[2.5,150],[1.95,149],[1,120],[3.95,150],[2.75,149],[3.5,155]])\n\n## finding correlation using pearson correlatoin formula\ntotal = len(gpa_days)\nsum_x = np.sum(gpa_days[:,0])\nsum_y = np.sum(gpa_days[:,1])\nsum_xx = np.sum(gpa_days[:,0]**2)\nsum_yy = np.sum(gpa_days[:,1]**2)\nsum_xy = np.sum(gpa_days[:,1]*gpa_days[:,0])\n\ncorrelation_p = (total*sum_xy - sum_x*sum_y)\/(math.sqrt(total*sum_xx - sum_x**2) * math.sqrt(total*sum_yy - sum_y**2))\n\nprint(\"correlation using formula:\",correlation_p)\n\nxy = [gpa_days[:,0],gpa_days[:,1]]\n\n# correlation using the numpy standard library \n# which internally uses pearsons correlation\ncorrelation_matrix = np.corrcoef(xy)\nprint(\"correlation using numpy:\",correlation_matrix[0][1])\n\n\nfig, ax = plt.subplots(ncols=2, figsize=(15,5))\n\n\nsn.heatmap(np.corrcoef(xy), color=\"k\", annot=True, ax=ax[1])\nax[1].set_title(\"correlation matrix\")\n\nsn.scatterplot(gpa_days[:,0], gpa_days[:,1], ax=ax[0], x=\"GPA\", y=\"Numbers of days attended (days)\")\nax[0].set_title(\"Attendance vs GPA dataset\")<\/code><\/pre>\n\n\n\n<\/p>\n\n\n\nOutput of the python code<\/figcaption><\/figure>\n\n\n\nCorrelation Matrix<\/h2>\n\n\n\n When there are more than 2 variables and you want to understand how correlated all the variables are, we use a correlation matrix that gives us a single view of all correlations. A correlation matrix is nothing but a table showing correlation coefficients among your variables. Each cell in the table shows the correlation between two variables.<\/p>\n\n\n\nWine Dataset Sample: The data snapshot above is the result of chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.<\/figcaption><\/figure>\n\n\n\nThe below matrix shows the correlation among different constituents of wine in our dataset. <\/p>\n\n\n\nCorrelation matrix<\/figcaption><\/figure>\n\n\n\nFrom the correlation matrix above we can make the following observations:<\/p>\n\n\n\n
density<\/em> has a strong positive correlation with residual sugar,<\/em> whereas it has a strong negative correlation with alcohol<\/em>.<\/li>pH & fixed acidity has a negative correlation.<\/li> density & fixed acidity has a positive correlation.<\/li> citric acid & fixed acidity has a positive correlation.<\/li> citric acid & volatile acidity has a negative correlation.<\/li> free sulfur dioxide & total sulfur dioxide has a positive correlation.<\/li><\/ul>\n\n\n\nCode for the above analysis<\/p>\n\n\n\n
import pandas as pd\nimport seaborn as sn\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets import load_wine\n\n# exploring wine dataset\ndf = pd.read_csv('https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/wine-quality\/winequality-white.csv', sep=';')\nprint(df.head())\nplt.figure(figsize=(20,20))\nplt.title(\"Correlation Matrix Wine Dataset\")\nsn.heatmap(df.corr(), color=\"k\", annot=True, cmap=\"YlGnBu\")<\/code><\/pre>\n\n\n\n