Understanding Correlations and Correlation Matrix

Correlation is the measure of how two or more variables are related to one another, also referred to as linear dependence. An increase in demand for a product increases its price, also called the demand curve, traffic on roads at certain intervals of time of the day, the amount of rain correlates with grass fires, the examples are many.

Causation

Correlation doesn’t imply causation, even though the two variables have a linear dependence, one should not assume that one is affecting the other without proper hypothesis testing. Correlation will give you an exploratory overview of any dependence between variables in your dataset, their causation can only be understood after careful study. For example, women who are more educated tend to have lesser children. Women who are less educated tend to have more children, it’s a general observation. If you look at the population of developed and under-developed countries and look at their national education index, the two seem to be correlated but we can’t say education makes you produce lesser babies. So, correlation is best used as a suggestion rather than a technique that gives definitive answers. It is often a preparatory piece of analysis that gives some clues to what the data might yield, to be followed with other techniques like regression.

Positive and Negative Correlation

Positive Correlation

Two variables X and Y are positively correlated if high values of X go with high values of Y and low values of X go with lower values of Y. For Example:

  • Height and Weight – Taller people are generally heavier. But many shorter ones are heavy (Correlation doesn’t imply Causation). The cause of this behavior cannot be associated with height alone.
Weight and Height Correlation [1]

Negative Correlation

Two variables are said to be negatively correlated if a high value of X goes with low values of Y and vice versa. For Example:

  • More educated women tend to have lesser children. This doesn’t mean that more education causes women to have lesser children, it’s usually caused by many factors which may not be the same in different countries. There are many socio-economic factors that show a strong positive correlation between more education and fertility [4] [5] [6], one article will not be enough to cover the entire scope of this research.

No Correlation

When X and Y have no relation, i.e a change in one variable doesn’t affect the other variable.

Identifying Correlation

One of the ways to identify correlation is to look for visual cues in scatter plots. An increasing trend line indicates a positive correlation, while a decreasing trend line may indicate a negative correlation.

While the above method may work as a preliminary analysis but to get a concrete measure, we use something called a correlation coefficient to get the exact degree of correlation.

Pearson’s Correlation coefficient

It gives an estimate of the correlation between two variables. For continuous variables, we usually use Pearson’s correlation coefficient. It is the covariance of the two variables divided by the product of their standard deviations. The value range from -1 (perfect negative correlation) to +1 (perfect positive correlation); 0 indicates no correlation.

Given a pair of random variables \((X,Y) \), the formula for Pearson Correlation Coefficient denoted by \(ρ \) is:

\(𝜌 = \frac{cov(X,Y)}{σ(X)σ(Y)} \)

where:

\(cov \) is covariance
\(σ(X) \) is the standard deviation of X
\(σ(Y) \) is the standard deviation of Y

The formula for covariance is:

\(cov(X,Y) = \sum_{i=1}^{n}(x_{i} – \overline{x})(y_{i} – \overline{y}) \)

Standard deviation is given by

\(\rho(X) = \sqrt{\sum_{i=1}^{n}(x_{i} – \overline{x})^2} \)
\(\rho(Y) = \sqrt{\sum_{i=1}^{n}(y_{i} – \overline{y})^2} \)

which gives us the pearson correlation coefficient as:

\(\rho(X,Y) = \frac{\sum_{i=1}^{n}(x_{i} – \overline{x})(y_{i} – \overline{y})}{\sqrt{\sum_{i=1}^{n}(x_{i} – \overline{x})^2}\sqrt{\sum_{i=1}^{n}(y_{i} – \overline{y})^2}} \)

where,

n is the sample size. The formula can be rearranged in a more simplified format by simplifying the mean:

\(\rho(X,Y) = \frac{n\sum xy – \sum x\sum y}{\sqrt{n\sum x^2 – (\sum x)^2} \sqrt{n\sum y^2 – (\sum y)^2} } \)

The value of the coefficient of correlation ρ always ranges from -1 to +1. The correlation coefficient describes not only the magnitude of correlation but also its direction. +0.8 indicates that correlation is positive because the sign of ρ is plus and the degree of correlation is high because the numerical value of ρ=0.8 is close to 1. If ρ=-0.4, it indicates that there is a low degree of negative correlation because the sigh of ρ is negative and the numerical value of ρ is less than 0.5

Note:

The correlation coefficient is sensitive to outliers as you might have guessed already because of the use of mean in the formula. So, in exploratory data analysis, it is important to remove any outliers from the dataset before finding the correlation.

Let’s try to find correlation coefficient on a sample dataset. A classic example is the correlation between the student’s GPA and their attendance in classroom. Looking at the scatter plot trendline we can assume there is a positive correlation here, so let’s try and find out the magnitude of correlation.

The dataset contains Attendance and GPA of 75 students in a class with the number of school equal to 180. The full dataset is given below.

GPA (X) Days Present (Y) \(X^2 \) \(Y^2 \) \(XY \)
4 180 16 32400 720
2.5 150 6.25 22500 375
4 170 16 28900 680
3.9 180 15.21 32400 702
3.75 177 14.0625 31329 663.75
3.8 180 14.44 32400 684
2.9 140 8.41 19600 406
3.1 169 9.61 28561 523.9
3.25 168 10.5625 28224 546
3.4 152 11.56 23104 516.8
3.3 150 10.89 22500 495
3.9 170 15.21 28900 663
1.35 109 1.8225 11881 147.15
4 180 16 32400 720
1 108 1 11664 108
3.85 175 14.8225 30625 673.75
2.98 144 8.8804 20736 429.12
2.75 120 7.5625 14400 330
2.75 133 7.5625 17689 365.75
3.6 160 12.96 25600 576
3.5 160 12.25 25600 560
3.5 159 12.25 25281 556.5
3.5 165 12.25 27225 577.5
3.85 180 14.8225 32400 693
2.95 149 8.7025 22201 439.55
3.95 180 15.6025 32400 711
3.65 160 13.3225 25600 584
3.55 155 12.6025 24025 550.25
3.58 156 12.8164 24336 558.48
2.98 145 8.8804 21025 432.1
1.5 122 2.25 14884 183
1.75 131 3.0625 17161 229.25
2.2 156 4.84 24336 343.2
3 166 9 27556 498
3 170 9 28900 510
3 155 9 24025 465
3.15 158 9.9225 24964 497.7
3.9 170 15.21 28900 663
3.15 160 9.9225 25600 504
3.85 165 14.8225 27225 635.25
2.7 159 7.29 25281 429.3
1 119 1 14161 119
3.25 168 10.5625 28224 546
3.9 175 15.21 30625 682.5
2.8 161 7.84 25921 450.8
3.5 160 12.25 25600 560
3.4 160 11.56 25600 544
2.3 150 5.29 22500 345
2.5 140 6.25 19600 350
2.35 148 5.5225 21904 347.8
2.95 149 8.7025 22201 439.55
3.55 160 12.6025 25600 568
3.6 155 12.96 24025 558
3.3 166 10.89 27556 547.8
3.85 160 14.8225 25600 616
3.95 179 15.6025 32041 707.05
2.95 145 8.7025 21025 427.75
2 143 4 20449 286
2 145 4 21025 290
1.75 140 3.0625 19600 245
1.5 122 2.25 14884 183
1.5 125 2.25 15625 187.5
1 110 1 12100 110
1.95 120 3.8025 14400 234
1.8 165 3.24 27225 297
2 120 4 14400 240
3.25 171 10.5625 29241 555.75
3.9 160 15.21 25600 624
2.15 144 4.6225 20736 309.6
2.5 150 6.25 22500 375
1.95 149 3.8025 22201 290.55
1 120 1 14400 120
3.95 150 15.6025 22500 592.5
2.75 149 7.5625 22201 409.75
3.5 155 12.25 24025 542.5
\(\sum X = \) 219.89 \(\sum Y = \)11469.0 \(\sum X^2 = \) 700.8697 \(\sum Y^2 = \)1780033.0 \(\sum XY = \)34646.7

Substituting the above values in our correlation coefficient formula:

\(\rho(X,Y) = \frac{n\sum XY – \sum X\sum Y}{\sqrt{n\sum X^2 – (\sum X)^2} \sqrt{n\sum Y^2 – (\sum Y)^2} } \)

we get:

\(\rho(X,Y) = \frac{75 \times 700.8697 – 219.89 \times 34646.7}{\sqrt{75 \times 700.8697 – 219.89^2} \sqrt{75 \times 1780033.0 – 11469^2} } = 0.84 \)

Indicating a POSITIVE correlation.

Correlation using python

There are many standard python libraries which can be used to calculate correlation, I will use the well known numpy library. Below code shows the calculations for the above dataset using formula as well numpy.

import numpy as np
import math
import seaborn as sn
import matplotlib.pyplot as plt

# setting seaborn as default chart
sn.set()

# dataset
gpa_days = np.array([[4,180],[2.5,150],[4,170],[3.9,180],[3.75,177],[3.8,180],[2.9,140],[3.1,169],[3.25,168],[3.4,152],[3.3,150],[3.9,170],[1.35,109],[4,180],[1,108],[3.85,175],[2.98,144],[2.75,120],[2.75,133],[3.6,160],[3.5,160],[3.5,159],[3.5,165],[3.85,180],[2.95,149],[3.95,180],[3.65,160],[3.55,155],[3.58,156],[2.98,145],[1.5,122],[1.75,131],[2.2,156],[3,166],[3,170],[3,155],[3.15,158],[3.9,170],[3.15,160],[3.85,165],[2.7,159],[1,119],[3.25,168],[3.9,175],[2.8,161],[3.5,160],[3.4,160],[2.3,150],[2.5,140],[2.35,148],[2.95,149],[3.55,160],[3.6,155],[3.3,166],[3.85,160],[3.95,179],[2.95,145],[2,143],[2,145],[1.75,140],[1.5,122],[1.5,125],[1,110],[1.95,120],[1.8,165],[2,120],[3.25,171],[3.9,160],[2.15,144],[2.5,150],[1.95,149],[1,120],[3.95,150],[2.75,149],[3.5,155]])

## finding correlation using pearson correlatoin formula
total = len(gpa_days)
sum_x = np.sum(gpa_days[:,0])
sum_y = np.sum(gpa_days[:,1])
sum_xx = np.sum(gpa_days[:,0]**2)
sum_yy = np.sum(gpa_days[:,1]**2)
sum_xy = np.sum(gpa_days[:,1]*gpa_days[:,0])

correlation_p = (total*sum_xy - sum_x*sum_y)/(math.sqrt(total*sum_xx - sum_x**2) * math.sqrt(total*sum_yy - sum_y**2))

print("correlation using formula:",correlation_p)

xy = [gpa_days[:,0],gpa_days[:,1]]

# correlation using the numpy standard library 
# which internally uses pearsons correlation
correlation_matrix = np.corrcoef(xy)
print("correlation using numpy:",correlation_matrix[0][1])


fig, ax = plt.subplots(ncols=2, figsize=(15,5))


sn.heatmap(np.corrcoef(xy), color="k", annot=True, ax=ax[1])
ax[1].set_title("correlation matrix")

sn.scatterplot(gpa_days[:,0], gpa_days[:,1], ax=ax[0], x="GPA", y="Numbers of days attended (days)")
ax[0].set_title("Attendance vs GPA dataset")

student data and its correlation matrix
Output of the python code

Correlation Matrix

When there are more than 2 variables and you want to understand how correlated all the variables are, we use a correlation matrix that gives us a single view of all correlations. A correlation matrix is nothing but a table showing correlation coefficients among your variables. Each cell in the table shows the correlation between two variables.

Wine Dataset Sample: The data snapshot above is the result of chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

The below matrix shows the correlation among different constituents of wine in our dataset.

Correlation matrix

From the correlation matrix above we can make the following observations:

  • density has a strong positive correlation with residual sugar, whereas it has a strong negative correlation with alcohol.
  • pH & fixed acidity has a negative correlation.
  • density & fixed acidity has a positive correlation.
  • citric acid & fixed acidity has a positive correlation.
  • citric acid & volatile acidity has a negative correlation.
  • free sulfur dioxide & total sulfur dioxide has a positive correlation.

Code for the above analysis

import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine

# exploring wine dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
print(df.head())
plt.figure(figsize=(20,20))
plt.title("Correlation Matrix Wine Dataset")
sn.heatmap(df.corr(), color="k", annot=True, cmap="YlGnBu")

Key Ideas

  • The correlation coefficient measures the extent to which two pairs of variables are related to each other.
  • Scatter plots are used to get a visual understanding of correlation.
  • Correlation Matrix can be used to get a snapshot of the relationship between more than two variables in a tabular format.
  • The correlation coefficient is a standardized metric that ranges from -1 and +1. +ve values indicate a positive correlation. -ve values indicate a negative correlation. 0 indicates no correlation.

Data Sources:

[1] Height and Weight datasource -http://www.math.utah.edu/~korevaar/2270fall09/project2/htwts09.pdf
[2] Wine Dataset – https://archive.ics.uci.edu/ml/datasets/wine

References:

[1] A Simple Study on Weight and Height of Students https://www.hindawi.com/journals/tswj/2017/7258607/
[2] https://blogs.worldbank.org/health/female-education-and-childbearing-closer-look-data
[3] https://wol.iza.org/uploads/articles/228/pdfs/female-education-and-its-impact-on-fertility.pdf
[4] Becker, G S and G H Lewis (1973), “On the interaction between the quantity and quality of children”, Journal of Political Economy 81: S279–S288.
[5] Galor, O and D N Weil (1996), “The gender gap, fertility, and growth”, American Economic Review86(3): 374–387.
[6] Galor, O and D N Weil (2000), “Population, technology, and growth: From Malthusian stagnation to the demographic transition and beyond”, American Economic Review 90(4): 806–828.
[7] https://en.wikipedia.org/wiki/Pearson_correlation_coefficient