{"id":769,"date":"2018-07-07T08:10:09","date_gmt":"2018-07-07T08:10:09","guid":{"rendered":"http:\/\/muthu.co\/?p=769"},"modified":"2021-05-24T03:35:55","modified_gmt":"2021-05-24T03:35:55","slug":"mathematics-behind-k-mean-clustering-algorithm","status":"publish","type":"post","link":"http:\/\/write.muthu.co\/mathematics-behind-k-mean-clustering-algorithm\/","title":{"rendered":"Mathematics behind K-Mean Clustering algorithm"},"content":{"rendered":"\n<p>K-Means is one of the simplest unsupervised clustering algorithm which is used to cluster our data into K number of clusters. The algorithm iteratively assigns the data points to one of the K clusters based on how near the point is to the cluster centroid. The result of K-Means algorithm is:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>K number of cluster centroids<\/li><li>Data points classified into the clusters<\/li><\/ol>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105.png\"><img loading=\"lazy\" decoding=\"async\" width=\"365\" height=\"249\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105.png\" alt=\"\" class=\"wp-image-780\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105.png 365w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105-300x205.png 300w\" sizes=\"auto, (max-width: 365px) 100vw, 365px\" \/><\/a><figcaption>Data points clustered into 4 clusters, with centroids marked<\/figcaption><\/figure><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">Applications:<\/h4>\n\n\n\n<p>K-Means can be used for any type of grouping&nbsp;where data has not been explicitly labeled. Some of the real world examples are given below:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S1877050915014143\">Image Segmentation<\/a><\/li><li><a href=\"https:\/\/www.ijser.org\/paper\/Chromosome-Segmentation-Using-K-Means-Clustering.html\">Chromosome segmentation<\/a><\/li><li><a href=\"https:\/\/ieeexplore.ieee.org\/document\/5591774\/\">News Comments Clustering<\/a><\/li><li><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S1877050915035929\">Grouping inventory by sales activity<\/a><\/li><li><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S1877050914013155\">Clustering animals<\/a><\/li><li><a href=\"https:\/\/www.computerweekly.com\/tip\/Botnet-detection-through-DNS-behavior-and-clustering-analysis\">Bots and Anomaly Detection<\/a><\/li><\/ul>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Outline of the algorithm:<\/strong><\/h4>\n\n\n\n<p>Assuming we have input data points&nbsp;<span id=\"MathJax-Element-1-Frame\" class=\"MathJax\" style=\"box-sizing: border-box; display: inline; font-style: normal; font-weight: 400; line-height: normal; font-size: 16px; text-indent: 0px; text-align: left; text-transform: none; letter-spacing: normal; word-spacing: 0px; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; color: #404040; font-family: -apple-system, system-ui, 'Helvetica Neue', Arial, 'PingFang SC', 'Hiragino Sans GB', STHeiti, 'Microsoft YaHei', 'Microsoft JhengHei', 'Source Han Sans SC', 'Noto Sans CJK SC', 'Source Han Sans CN', 'Noto Sans SC', 'Source Han Sans TC', 'Noto Sans CJK TC', 'WenQuanYi Micro Hei', SimSun, sans-serif; font-variant-ligatures: normal; font-variant-caps: normal; orphans: 2; widows: 2; -webkit-text-stroke-width: 0px; background-color: #ffffff; text-decoration-style: initial; text-decoration-color: initial; position: relative;\" tabindex=\"0\" role=\"presentation\" data-mathml=\"<math xmlns=&quot;http:\/\/www.w3.org\/1998\/Math\/MathML&quot;><msub><mi>x<\/mi><mn>1<\/mn><\/msub><mo>,<\/mo><msub><mi>x<\/mi><mn>2<\/mn><\/msub><mo>,<\/mo><msub><mi>x<\/mi><mn>3<\/mn><\/msub><mo>,<\/mo><mo>&amp;#x2026;<\/mo><mo>,<\/mo><msub><mi>x<\/mi><mi>n<\/mi><\/msub><\/math>&#8220;><span class=\"MJX_Assistive_MathML\" role=\"presentation\">x<sub>1<\/sub>,x<sub>2<\/sub>,x<sub>3<\/sub>,\u2026,x<sub>n<\/sub><\/span><\/span>&nbsp;and value of&nbsp;<strong>K <\/strong>(the number of clusters needed). We follow the below procedure:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Pick K points as the initial centroids from the dataset, either randomly or the first K.<\/li><li>Find the Euclidean distance of each point in the dataset with the identified K points (cluster centroids).<\/li><li>Assign each data point to the closest centroid using the distance found in the previous step.<\/li><li>Find the new centroid by taking the average of the points in each cluster group.<\/li><li>Repeat 2 to 4 for a fixed number of iteration or till the centroids don&#8217;t change.<\/li><\/ol>\n\n\n\n<h6 class=\"wp-block-heading\">Euclidean Distance between two points in space:<\/h6>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_99.png\"><img loading=\"lazy\" decoding=\"async\" width=\"282\" height=\"36\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_99.png\" alt=\"\" class=\"wp-image-774\"\/><\/a><\/figure><\/div>\n\n\n\n<p>If&nbsp;<b>p<\/b>&nbsp;=&nbsp;(<i>p<\/i><sub>1<\/sub>,&nbsp;<i>p<\/i><sub>2<\/sub>) and&nbsp;<strong>q<\/strong>&nbsp;=&nbsp;(<i>q<\/i><sub>1<\/sub>,&nbsp;<i>q<\/i><sub>2<\/sub>) then the distance is given by<\/p>\n\n\n\n<p><strong>Implementation:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code EnlighterJSRAW\"><code>def euclidean_distance(point1, point2):\n     return math.sqrt((point1&#91;0]-point2&#91;0])**2 + (point1&#91;1]-point2&#91;1])**2)<\/code><\/pre>\n\n\n\n<h6 class=\"wp-block-heading\">Assigning each point to the nearest cluster:<\/h6>\n\n\n\n<p>If each cluster centroid is denoted by <em>c<sub>i<\/sub>,&nbsp;<\/em>then each data point&nbsp;x&nbsp;is assigned to a cluster based on<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_100.png\"><img loading=\"lazy\" decoding=\"async\" width=\"167\" height=\"59\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_100.png\" alt=\"\" class=\"wp-image-775\"\/><\/a><\/figure><\/div>\n\n\n\n<p>here<em> dist()<\/em> is the euclidean distance<\/p>\n\n\n\n<p><strong>Implementation:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code EnlighterJSRAW\"><code>#find the distance between the points and the centroids\nfor point in data:\n    distances = &#91;]\n    for index in self.centroids:\n         distances.append(self.euclidean_distance(point,self.centroids&#91;index]))\n                \n         #find which cluster the datapoint belongs to by finding the minimum\n         #ex: if distances are 2.03,1.04,5.6,1.05 then point belongs to cluster 1 (zero index)\n         cluster_index = distances.index(min(distances))\n         self.classes&#91;cluster_index].append(point)<\/code><\/pre>\n\n\n\n<h6 class=\"wp-block-heading\">Finding the new centroid from the clustered group of points:<\/h6>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_101.png\"><img loading=\"lazy\" decoding=\"async\" width=\"171\" height=\"83\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_101.png\" alt=\"\" class=\"wp-image-776\"\/><\/a><\/figure><\/div>\n\n\n\n<p>S<sub>i<\/sub> is the set of all points assigned to the <em>ith<\/em> cluster.<\/p>\n\n\n\n<p><strong>Implementation:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code EnlighterJSRAW\"><code>#find new centroid by taking the centroid of the points in the cluster class\nfor cluster_index in self.classes:\n    self.centroids&#91;cluster_index] = np.average(self.classes&#91;cluster_index], axis = 0)<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">K-Means full implementation<\/h4>\n\n\n\n<p><script src=\"https:\/\/gist.github.com\/muthuspark\/1516df14143c0b8aecdd6b3a6f883428.js\"><\/script><\/p>\n\n\n\n<p>The above program creates a sample set for 4 clusters and then performs K-Means on it<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_103.png\"><img loading=\"lazy\" decoding=\"async\" width=\"378\" height=\"255\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_103.png\" alt=\"\" class=\"wp-image-778\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_103.png 378w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_103-300x202.png 300w\" sizes=\"auto, (max-width: 378px) 100vw, 378px\" \/><\/a><figcaption>Original Data points<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_104.png\"><img loading=\"lazy\" decoding=\"async\" width=\"373\" height=\"252\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_104.png\" alt=\"\" class=\"wp-image-779\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_104.png 373w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_104-300x203.png 300w\" sizes=\"auto, (max-width: 373px) 100vw, 373px\" \/><\/a><figcaption>After K-Means<\/figcaption><\/figure><\/div>\n\n\n\n<h4 class=\"wp-block-heading\">K-Means implementation using python sklearn:<\/h4>\n\n\n\n<pre class=\"wp-block-code EnlighterJSRAW\"><code>from sklearn.cluster import KMeans\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.datasets.samples_generator import make_blobs\n\n#generate dummy cluster datasets\nK = 4\nX, y_true = make_blobs(n_samples=300, centers=K, \n                       cluster_std=0.60, random_state=0)\nk_means = KMeans(K)\nk_means.fit(X)\n\ncluster_centres = k_means.cluster_centers_\n\ny_kmeans = k_means.predict(X)\nplt.scatter(X&#91;:, 0], X&#91;:, 1], c=y_kmeans, s=50, cmap='viridis')\n\nfor centroid in cluster_centres:\n    plt.scatter(centroid&#91;0],  centroid&#91;1], s=300,  c='black', alpha=0.5)<\/code><\/pre>\n\n\n\n<p>Output from the above program:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105.png\"><img loading=\"lazy\" decoding=\"async\" width=\"365\" height=\"249\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105.png\" alt=\"\" class=\"wp-image-780\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105.png 365w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_105-300x205.png 300w\" sizes=\"auto, (max-width: 365px) 100vw, 365px\" \/><\/a><\/figure><\/div>\n\n\n\n<p>The output of the clustering we wrote from scratch is similar to the one we get by using the sklearn library.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Choosing Initial centroids<\/h4>\n\n\n\n<p>In our implementation we chose the first 4 points as our initial cluster centroids which may give slightly different centroids each time the program is run on random dataset. We can also use the <a href=\"https:\/\/en.wikipedia.org\/wiki\/K-means%2B%2B\">K-means++<\/a> method to choose our initial centroids.&nbsp;k-means++ was proposed in 2007 by Arthur and Vassilvitskii. This algorithm comes with a theoretical guarantee to find a solution that is O(log k) competitive to the optimal k-means solution. Sklearn KMeans class uses kmeans++ as the default method for seeding the algorithm.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><a href=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_106.png\"><img loading=\"lazy\" decoding=\"async\" width=\"693\" height=\"45\" src=\"https:\/\/muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_106.png\" alt=\"\" class=\"wp-image-781\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_106.png 693w, http:\/\/write.muthu.co\/wp-content\/uploads\/2018\/07\/Snip20180707_106-300x19.png 300w\" sizes=\"auto, (max-width: 693px) 100vw, 693px\" \/><\/a><\/figure><\/div>\n","protected":false},"excerpt":{"rendered":"<p>K-Means is one of the simplest unsupervised clustering algorithm which is used to cluster our data into K number of clusters. The algorithm iteratively assigns the data points to one of the K clusters based on how near the point is to the cluster centroid. The result of K-Means algorithm is: K number of cluster [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[24,37],"tags":[46,49],"class_list":["post-769","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-data-science","tag-artificial-intelligence","tag-data-science"],"_links":{"self":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/769","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/comments?post=769"}],"version-history":[{"count":2,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/769\/revisions"}],"predecessor-version":[{"id":1891,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/769\/revisions\/1891"}],"wp:attachment":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media?parent=769"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/categories?post=769"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/tags?post=769"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}