Understanding Correlations and Correlation Matrix

Correlation is the measure of how two or more variables are related to one another, also referred to as linear dependence. An increase in demand for a product increases its price, also called the demand curve, traffic on roads at certain intervals of time of the day, the amount of rain correlates with grass fires, the examples are many.

Causation

Correlation doesn’t imply causation, even though the two variables have a linear dependence, one should not assume that one is affecting the other without proper hypothesis testing. Correlation will give you an exploratory overview of any dependence between variables in your dataset, their causation can only be understood after careful study. For example, women who are more educated tend to have lesser children. Women who are less educated tend to have more children, it’s a general observation. If you look at the population of developed and under-developed countries and look at their national education index, the two seem to be correlated but we can’t say education makes you produce lesser babies. So, correlation is best used as a suggestion rather than a technique that gives definitive answers. It is often a preparatory piece of analysis that gives some clues to what the data might yield, to be followed with other techniques like regression.

Positive and Negative Correlation

Positive Correlation

Two variables X and Y are positively correlated if high values of X go with high values of Y and low values of X go with lower values of Y. For Example:

Height and Weight – Taller people are generally heavier. But many shorter ones are heavy (Correlation doesn’t imply Causation). The cause of this behavior cannot be associated with height alone.

Negative Correlation

Two variables are said to be negatively correlated if a high value of X goes with low values of Y and vice versa. For Example:

More educated women tend to have lesser children. This doesn’t mean that more education causes women to have lesser children, it’s usually caused by many factors which may not be the same in different countries. There are many socio-economic factors that show a strong positive correlation between more education and fertility [4] [5] [6], one article will not be enough to cover the entire scope of this research.

No Correlation

When X and Y have no relation, i.e a change in one variable doesn’t affect the other variable.

Identifying Correlation

One of the ways to identify correlation is to look for visual cues in scatter plots. An increasing trend line indicates a positive correlation, while a decreasing trend line may indicate a negative correlation.

While the above method may work as a preliminary analysis but to get a concrete measure, we use something called a correlation coefficient to get the exact degree of correlation.

Pearson’s Correlation coefficient

It gives an estimate of the correlation between two variables. For continuous variables, we usually use Pearson’s correlation coefficient. It is the covariance of the two variables divided by the product of their standard deviations. The value range from -1 (perfect negative correlation) to +1 (perfect positive correlation); 0 indicates no correlation.

Given a pair of random variables \((X,Y) \), the formula for Pearson Correlation Coefficient denoted by \(ρ \) is:

\(𝜌 = \frac{cov(X,Y)}{σ(X)σ(Y)} \)

where:

\(cov \) is covariance
\(σ(X) \) is the standard deviation of X
\(σ(Y) \) is the standard deviation of Y

The formula for covariance is:

\(cov(X,Y) = \sum_{i=1}^{n}(x_{i} – \overline{x})(y_{i} – \overline{y}) \)

Standard deviation is given by

\(\rho(X) = \sqrt{\sum_{i=1}^{n}(x_{i} – \overline{x})^2} \)
\(\rho(Y) = \sqrt{\sum_{i=1}^{n}(y_{i} – \overline{y})^2} \)

which gives us the pearson correlation coefficient as:

\(\rho(X,Y) = \frac{\sum_{i=1}^{n}(x_{i} – \overline{x})(y_{i} – \overline{y})}{\sqrt{\sum_{i=1}^{n}(x_{i} – \overline{x})^2}\sqrt{\sum_{i=1}^{n}(y_{i} – \overline{y})^2}} \)

where,

n is the sample size. The formula can be rearranged in a more simplified format by simplifying the mean:

\(\rho(X,Y) = \frac{n\sum xy – \sum x\sum y}{\sqrt{n\sum x^2 – (\sum x)^2} \sqrt{n\sum y^2 – (\sum y)^2} } \)

The value of the coefficient of correlation ρ always ranges from -1 to +1. The correlation coefficient describes not only the magnitude of correlation but also its direction. +0.8 indicates that correlation is positive because the sign of ρ is plus and the degree of correlation is high because the numerical value of ρ=0.8 is close to 1. If ρ=-0.4, it indicates that there is a low degree of negative correlation because the sigh of ρ is negative and the numerical value of ρ is less than 0.5

Note:

The correlation coefficient is sensitive to outliers as you might have guessed already because of the use of mean in the formula. So, in exploratory data analysis, it is important to remove any outliers from the dataset before finding the correlation.

Let’s try to find correlation coefficient on a sample dataset. A classic example is the correlation between the student’s GPA and their attendance in classroom. Looking at the scatter plot trendline we can assume there is a positive correlation here, so let’s try and find out the magnitude of correlation.

The dataset contains Attendance and GPA of 75 students in a class with the number of school equal to 180. The full dataset is given below.

GPA (X)	Days Present (Y)	\(X^2 \)	\(Y^2 \)	\(XY \)
4	180	16	32400	720
2.5	150	6.25	22500	375
4	170	16	28900	680
3.9	180	15.21	32400	702
3.75	177	14.0625	31329	663.75
3.8	180	14.44	32400	684
2.9	140	8.41	19600	406
3.1	169	9.61	28561	523.9
3.25	168	10.5625	28224	546
3.4	152	11.56	23104	516.8
3.3	150	10.89	22500	495
3.9	170	15.21	28900	663
1.35	109	1.8225	11881	147.15
4	180	16	32400	720
1	108	1	11664	108
3.85	175	14.8225	30625	673.75
2.98	144	8.8804	20736	429.12
2.75	120	7.5625	14400	330
2.75	133	7.5625	17689	365.75
3.6	160	12.96	25600	576
3.5	160	12.25	25600	560
3.5	159	12.25	25281	556.5
3.5	165	12.25	27225	577.5
3.85	180	14.8225	32400	693
2.95	149	8.7025	22201	439.55
3.95	180	15.6025	32400	711
3.65	160	13.3225	25600	584
3.55	155	12.6025	24025	550.25
3.58	156	12.8164	24336	558.48
2.98	145	8.8804	21025	432.1
1.5	122	2.25	14884	183
1.75	131	3.0625	17161	229.25
2.2	156	4.84	24336	343.2
3	166	9	27556	498
3	170	9	28900	510
3	155	9	24025	465
3.15	158	9.9225	24964	497.7
3.9	170	15.21	28900	663
3.15	160	9.9225	25600	504
3.85	165	14.8225	27225	635.25
2.7	159	7.29	25281	429.3
1	119	1	14161	119
3.25	168	10.5625	28224	546
3.9	175	15.21	30625	682.5
2.8	161	7.84	25921	450.8
3.5	160	12.25	25600	560
3.4	160	11.56	25600	544
2.3	150	5.29	22500	345
2.5	140	6.25	19600	350
2.35	148	5.5225	21904	347.8
2.95	149	8.7025	22201	439.55
3.55	160	12.6025	25600	568
3.6	155	12.96	24025	558
3.3	166	10.89	27556	547.8
3.85	160	14.8225	25600	616
3.95	179	15.6025	32041	707.05
2.95	145	8.7025	21025	427.75
2	143	4	20449	286
2	145	4	21025	290
1.75	140	3.0625	19600	245
1.5	122	2.25	14884	183
1.5	125	2.25	15625	187.5
1	110	1	12100	110
1.95	120	3.8025	14400	234
1.8	165	3.24	27225	297
2	120	4	14400	240
3.25	171	10.5625	29241	555.75
3.9	160	15.21	25600	624
2.15	144	4.6225	20736	309.6
2.5	150	6.25	22500	375
1.95	149	3.8025	22201	290.55
1	120	1	14400	120
3.95	150	15.6025	22500	592.5
2.75	149	7.5625	22201	409.75
3.5	155	12.25	24025	542.5
\(\sum X = \) 219.89	\(\sum Y = \)11469.0	\(\sum X^2 = \) 700.8697	\(\sum Y^2 = \)1780033.0	\(\sum XY = \)34646.7

Substituting the above values in our correlation coefficient formula:

\(\rho(X,Y) = \frac{n\sum XY – \sum X\sum Y}{\sqrt{n\sum X^2 – (\sum X)^2} \sqrt{n\sum Y^2 – (\sum Y)^2} } \)

we get:

\(\rho(X,Y) = \frac{75 \times 700.8697 – 219.89 \times 34646.7}{\sqrt{75 \times 700.8697 – 219.89^2} \sqrt{75 \times 1780033.0 – 11469^2} } = 0.84 \)

Indicating a POSITIVE correlation.

Correlation using python

There are many standard python libraries which can be used to calculate correlation, I will use the well known numpy library. Below code shows the calculations for the above dataset using formula as well numpy.

import numpy as np
import math
import seaborn as sn
import matplotlib.pyplot as plt

# setting seaborn as default chart
sn.set()

# dataset
gpa_days = np.array([[4,180],[2.5,150],[4,170],[3.9,180],[3.75,177],[3.8,180],[2.9,140],[3.1,169],[3.25,168],[3.4,152],[3.3,150],[3.9,170],[1.35,109],[4,180],[1,108],[3.85,175],[2.98,144],[2.75,120],[2.75,133],[3.6,160],[3.5,160],[3.5,159],[3.5,165],[3.85,180],[2.95,149],[3.95,180],[3.65,160],[3.55,155],[3.58,156],[2.98,145],[1.5,122],[1.75,131],[2.2,156],[3,166],[3,170],[3,155],[3.15,158],[3.9,170],[3.15,160],[3.85,165],[2.7,159],[1,119],[3.25,168],[3.9,175],[2.8,161],[3.5,160],[3.4,160],[2.3,150],[2.5,140],[2.35,148],[2.95,149],[3.55,160],[3.6,155],[3.3,166],[3.85,160],[3.95,179],[2.95,145],[2,143],[2,145],[1.75,140],[1.5,122],[1.5,125],[1,110],[1.95,120],[1.8,165],[2,120],[3.25,171],[3.9,160],[2.15,144],[2.5,150],[1.95,149],[1,120],[3.95,150],[2.75,149],[3.5,155]])

## finding correlation using pearson correlatoin formula
total = len(gpa_days)
sum_x = np.sum(gpa_days[:,0])
sum_y = np.sum(gpa_days[:,1])
sum_xx = np.sum(gpa_days[:,0]**2)
sum_yy = np.sum(gpa_days[:,1]**2)
sum_xy = np.sum(gpa_days[:,1]*gpa_days[:,0])

correlation_p = (total*sum_xy - sum_x*sum_y)/(math.sqrt(total*sum_xx - sum_x**2) * math.sqrt(total*sum_yy - sum_y**2))

print("correlation using formula:",correlation_p)

xy = [gpa_days[:,0],gpa_days[:,1]]

# correlation using the numpy standard library 
# which internally uses pearsons correlation
correlation_matrix = np.corrcoef(xy)
print("correlation using numpy:",correlation_matrix[0][1])


fig, ax = plt.subplots(ncols=2, figsize=(15,5))


sn.heatmap(np.corrcoef(xy), color="k", annot=True, ax=ax[1])
ax[1].set_title("correlation matrix")

sn.scatterplot(gpa_days[:,0], gpa_days[:,1], ax=ax[0], x="GPA", y="Numbers of days attended (days)")
ax[0].set_title("Attendance vs GPA dataset")

Correlation Matrix

When there are more than 2 variables and you want to understand how correlated all the variables are, we use a correlation matrix that gives us a single view of all correlations. A correlation matrix is nothing but a table showing correlation coefficients among your variables. Each cell in the table shows the correlation between two variables.

Wine Dataset Sample: The data snapshot above is the result of chemical analysis of wines grown in the same region in Italy but derived from three different cultivars.

The below matrix shows the correlation among different constituents of wine in our dataset.

From the correlation matrix above we can make the following observations:

density has a strong positive correlation with residual sugar, whereas it has a strong negative correlation with alcohol.
pH & fixed acidity has a negative correlation.
density & fixed acidity has a positive correlation.
citric acid & fixed acidity has a positive correlation.
citric acid & volatile acidity has a negative correlation.
free sulfur dioxide & total sulfur dioxide has a positive correlation.

Code for the above analysis

import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine

# exploring wine dataset
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
print(df.head())
plt.figure(figsize=(20,20))
plt.title("Correlation Matrix Wine Dataset")
sn.heatmap(df.corr(), color="k", annot=True, cmap="YlGnBu")

Key Ideas

The correlation coefficient measures the extent to which two pairs of variables are related to each other.
Scatter plots are used to get a visual understanding of correlation.
Correlation Matrix can be used to get a snapshot of the relationship between more than two variables in a tabular format.
The correlation coefficient is a standardized metric that ranges from -1 and +1. +ve values indicate a positive correlation. -ve values indicate a negative correlation. 0 indicates no correlation.

Data Sources:

[1] Height and Weight datasource -http://www.math.utah.edu/~korevaar/2270fall09/project2/htwts09.pdf
[2] Wine Dataset – https://archive.ics.uci.edu/ml/datasets/wine

References:

[1] A Simple Study on Weight and Height of Students https://www.hindawi.com/journals/tswj/2017/7258607/
[2] https://blogs.worldbank.org/health/female-education-and-childbearing-closer-look-data
[3] https://wol.iza.org/uploads/articles/228/pdfs/female-education-and-its-impact-on-fertility.pdf
[4] Becker, G S and G H Lewis (1973), “On the interaction between the quantity and quality of children”, Journal of Political Economy 81: S279–S288.
[5] Galor, O and D N Weil (1996), “The gender gap, fertility, and growth”, American Economic Review86(3): 374–387.
[6] Galor, O and D N Weil (2000), “Population, technology, and growth: From Malthusian stagnation to the demographic transition and beyond”, American Economic Review 90(4): 806–828.
[7] https://en.wikipedia.org/wiki/Pearson_correlation_coefficient