{"id":801,"date":"2018-08-28T14:46:42","date_gmt":"2018-08-28T14:46:42","guid":{"rendered":"http:\/\/muthu.co\/?p=801"},"modified":"2021-05-24T03:33:49","modified_gmt":"2021-05-24T03:33:49","slug":"mathematics-of-principal-component-analysis","status":"publish","type":"post","link":"http:\/\/write.muthu.co\/mathematics-of-principal-component-analysis\/","title":{"rendered":"Mathematics of Principal component analysis"},"content":{"rendered":"\n

Principal component analysis is a method used to reduce the number of dimensions in a dataset without losing much information. It’s used in many fields such as face recognition and image compression and is a common technique for finding patterns in data and also in the visualization of higher-dimensional data. PCA is all about geometrically projecting the data onto lower dimensions called principal components (PCs). How important PCA is to the machine learning and AI community can only understand by searching the term “Principal Component Analysis” in google scholar. I have added a snapshot of my search result. (28-Aug-2018) <\/span><\/p>

\"\"<\/a><\/p><\/p>\n\n\n\n

Before diving into the mathematics of PCA, lets compress a sample image using the standard PCA algorithm provided by sklearn. This will give the readers a head start into the power of PCA.  For our example I am using the below image.<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

 <\/p>\n\n\n\n

import matplotlib.image as img \nimg_data = img.imread('bird.jpg') \nprint img_data.shape\n# (467, 700, 3)<\/code><\/pre>\n\n\n\n

I have about 700 dimensions in this image as you can see in the output. Let me reduce the dimensions to 400, 300, 200, 100, 50, 10 and restore it back to its original dimensions. Below is the result of my transformations.<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

As you can see in the above images, we haven’t lost a lot of information until around 50 components. This is why PCA is one of the most important algorithms when it comes to image manipulation and compression.<\/p>\n\n\n\n

The full code I used for generating the above analysis is given below:<\/p>\n\n\n\n

import matplotlib.image as img \nimport matplotlib.pyplot as plt\nfrom sklearn.decomposition import PCA\nimport numpy as np\n\noriginal_image = img.imread('bird.jpg') \nprint original_image.shape\nprint original_image[0]\n\nplt.axis('off')\nplt.imshow(original_image)\n\nimg_reshaped = np.reshape(original_image, (np.size(original_image, 0),np.size(original_image, 1)*np.size(original_image, 2)))\nprint img_reshaped.shape\n\nsubplot_index = 1\nfor n_components in [400, 300, 200, 100, 50, 10]:\n    plt.subplot(3, 2, subplot_index)\n    subplot_index = subplot_index + 1\n    ipca = PCA(n_components).fit(img_reshaped)\n    transf_img = ipca.transform(img_reshaped)\n    #restore the image from the subspace\n    image_restored = ipca.inverse_transform(transf_img)\n    #reshape the image to the original array size\n    image_restored = np.reshape(image_restored, (np.size(original_image, 0),np.size(original_image, 1),np.size(original_image, 2)))\n    image_restored = image_restored.astype(np.uint8)\n    plt.axis('off')\n    plt.title('n_components:' + str(n_components))\n    plt.imshow(image_restored)\n\nplt.show()<\/code><\/pre>\n\n\n\n

Now, let’s understand the mathematics behind PCA. First I will attempt to give some elementary background mathematical knowledge required to understand the process of PCA. You can skip the sections with

which you are already familiar with.<\/span><\/p><\/p>\n\n\n\n

Standard Deviation<\/strong><\/p><\/h2>\n\n\n\n

The Standard Deviation (SD) of a data set is a measure of how spread out the data is.\u00a0A low measure of Standard Deviation indicates that the data is less spread out, whereas a high value means the data is more spread apart from their mean average values. The formula that gives the SD is:

\"\"<\/a><\/p> where <\/span>xi<\/sub><\/em> is the observed values in our sample dataset. <\/span>x with a bar<\/em> on it is the mean of the samples, and <\/span>N<\/em> is the number of observations.<\/span> The below graph shows the distribution of IQ scores. It is evident from the graph that the majority of people have an IQ between 85 and 115. The curve is wider in this case with 68% population accumulated towards the median.<\/span>

\"\"<\/a><\/p><\/p>\n\n\n\n

Variance<\/h2>\n\n\n\n

Variance is another measure of the spread of data in a data set. In fact, it is simply standard deviation squared. The formula is :<\/p>

\"\"<\/a><\/p> <\/p>\n\n\n\n

Covariance<\/strong><\/p><\/h2>\n\n\n\n

We use variance and standard deviation only when dealing with 1-dimensional data but for two dimensions we use Covariance. If you think of stock price as a 1-dimensional data moving with time on the x-axis, then we can compare how 2 stocks move together using covariance.\u00a0 Covariance is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you get the variance. So, if you had a 3-dimensional data set (x, y, z), then you could measure the covariance between x and y dimensions, the x and z dimensions, and the y and z dimensions. Measuring the covariance between x and x, or y and y, or z and z would give you the variance of the x, y, and z dimensions.\u00a0<\/span><\/p>

The formula of variance can also be written as: <\/span><\/div>

\"\"<\/a><\/p> When dealing with more than one variable our <\/span>variance<\/em> or you may call it <\/span>covariance<\/em> becomes:\u00a0<\/span>

\"\"<\/a><\/p>Let’s take for example the movement of two stocks that I think may be correlated, HP and Dell,\u00a0<\/strong>and try to find the covariance.\u00a0Below is a table which shows their closing prices for one month.<\/span>

\"\"<\/a><\/p>

We will use this data to find the correlation as positive or negative.\u00a0If the value is positive, it indicates that both dimensions are increasing together.\u00a0If the value is negative, then as one dimension increases, the other decreases.<\/p><\/p>\n\n\n\n

import numpy as np price_HP = [15.59,15.45,15.44,15.45,15.55,15.98,16.01,16.129999,16.16,16.110001,15.85,15.66,15.99,16.030001,16.27,16.67,16.719999,15.78,15.82,16.1,16.200001] price_DELL = [93.669998,92.709999,92.519997,92.720001,93.040001,93.089996,93.440002,93.699997,94.150002,94.730003,94.529999,94,94.860001,94.68,95.580002,95.010002,95.029999,95.099998,95.730003,95.309998,95.339996] mean_HP = np.mean(price_HP) mean_DELL = np.mean(price_DELL) sum = 0 for i in range(len(price_HP)):     sum += (price_HP[i]-mean_HP)*(price_DELL[i]-mean_DELL) covariance = sum\/(len(price_HP)-1) print covariance #covariance from numpy library. print np.cov(price_HP,price_DELL)[0][1]\n\n##outputs.\n0.23774758583349764\n0.23774758583349764 <\/pre><\/code><\/pre>\n\n\n\n

As you can see, the covariance equals ~0.23<\/strong>\u00a0which is a positive number, so we can assume the two stocks are moving together.<\/p><\/p>\n\n\n\n

Covariance Matrix<\/strong><\/p><\/h2>\n\n\n\n

If your dataset has more than 2 dimensions then it can have more than one covariance measurement. For example, if you have a dataset with 3 dimensions x, y and z, then the covariance of this dataset is given by:<\/p>

\"\"<\/a><\/p>

Let’s see how the covariance matrix looks like when we have another stock added to the analysis. This time lets pick VMware<\/strong><\/a> price data.<\/p><\/p>\n\n\n\n

price_VMWARE = [<\/pre>  148.800003,\n  144.419998,\n  144.589996,\n  144.580002,\n  148.070007,\n  149.449997,\n  151.380005,\n  153.059998,\n  152.839996,\n  153.759995,\n  153.210007,\n  151.940002,\n  152.25,\n  152.039993,\n  151.929993,\n  150.880005,\n  151.619995,\n  151.679993,\n  154.279999,\n  154.770004,\n  151.369995\n]\n\nprint(np.cov(price_HP,price_DELL,price_VMWARE))<\/code><\/pre>\n\n\n\n

The output looks like this:<\/p><\/p>\n\n\n\n

\"\"<\/a><\/p> Which is a representation of:<\/span>

\"\"<\/a><\/p><\/p>\n\n\n\n

<\/p>\n\n\n\n

Eigenvectors and Eigenvalues<\/strong><\/p><\/h2>\n\n\n\n

Understanding these two properties is the most important part of understanding PCA. An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied on it. In mathematical terms, an Eigenvector when multiplied to a vector gives a product of the Eigenvector itself with a scalar.<\/p>

\"\"<\/a>The best explanation of Eigenvectors and Eigenvalues is given in the below video. I wish I had this video when I learnt about Eigenvectors for the first time.<\/p><\/p>\n\n\n\n

\n