{"id":1689,"date":"2021-01-19T02:16:39","date_gmt":"2021-01-19T02:16:39","guid":{"rendered":"http:\/\/192.168.31.181\/muthu\/?p=1689"},"modified":"2021-01-19T02:16:39","modified_gmt":"2021-01-19T02:16:39","slug":"understanding-interquartile-range-iqr-and-outliers","status":"publish","type":"post","link":"http:\/\/write.muthu.co\/understanding-interquartile-range-iqr-and-outliers\/","title":{"rendered":"Understanding Interquartile Range (IQR) and Outliers"},"content":{"rendered":"\n<p>When dealing with a large number of data, It&#8217;s a good practice to remove any outliers before further processing unless there is a good reason to keep them. Outliers in simple words are datapoint which are unusually far away from the rest of the dataset.<\/p>\n\n\n\n<p>For the lazy, here is the code to find outliers using IQR in a 1D array.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># pass a 1 D array\ndef outliers_iqr(ys):\n    quartile_1, quartile_3 = np.percentile(ys, &#91;25, 75])\n    iqr = quartile_3 - quartile_1\n    lower_bound = quartile_1 - (iqr * 1.5)\n    upper_bound = quartile_3 + (iqr * 1.5)\n    return np.where((ys > upper_bound) | (ys &lt; lower_bound))\n\n# pass a 1 D array\ndef remove_outliers(array):\n  index_outliers = outliers_iqr(array)\n  outliers_removed = np.delete(array, index_outliers)\n  return outliers_removed<\/code><\/pre>\n\n\n\n<p>Consider a hypothetical situation where a teacher is given the task of checking if the BMI of her students in the class is within a healthy range. Now after collecting the height and weight data of all her students, it&#8217;s important to check for any outlier which may affect the mean calculation. The outlier could also be an experimental error, not removing them may drag the BMI to a value which may not reflect the correct health of her students. <\/p>\n\n\n\n<p>The values removed from the total set is what we call Outliers. There are many ways to remove outliers, one of them is the IQR (Interquartile range) method. IQR gives us the middle 50% of the values from the histogram. The best way to visualize the IQR is through a box plot. Take a look at the below boxplot to get an understanding of IQR.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"637\" height=\"610\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/01\/mk.png\" alt=\"\" class=\"wp-image-1693\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/mk.png 637w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/mk-300x287.png 300w\" sizes=\"auto, (max-width: 637px) 100vw, 637px\" \/><\/figure>\n\n\n\n<p>The above diagram makes it clear what data belongs to outlier, the code for it looks like this: <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>def outliers_iqr(ys):\n    # Q1 is 25 percentile, Q3 is 75 percentile\n    quartile_1, quartile_3 = np.percentile(ys, &#91;25, 75])\n    # IQR is Q3 - Q1\n    iqr = quartile_3 - quartile_1\n    lower_bound = quartile_1 - (iqr * 1.5)\n    upper_bound = quartile_3 + (iqr * 1.5)\n    # Anything below the minimum or above the maximum becomes\n    # an outlier\n    return np.where((ys &gt; upper_bound) | (ys &lt; lower_bound))<\/code><\/pre>\n\n\n\n<p>As an example, lets look at the same hypothetical use case we discussed previously about BMI. I have a synthetic dataset containing heights and weights of about 30 students. Our goal is to understand how the mean gets affected due to Outliers in the dataset. <br><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Load and Visualize data<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code>import numpy as np\nimport pandas as pd\n\n\ndata = pd.read_csv('https:\/\/muthu.s3-us-west-2.amazonaws.com\/dataset\/weight-height.csv')\n\n# coverting height from inches to meter\ndata&#91;'Height'] = data&#91;'Height'] * 0.0254\n# converting weight from pounds to KG\ndata&#91;'Weight'] = data&#91;'Weight'] * 0.453592\n# BMI = m\/h^2\ndata&#91;'BMI'] = data&#91;'Weight']\/data&#91;'Height']**2\n\nprint(data)\ndata.boxplot('BMI', figsize=(10,10))<\/code><\/pre>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.14.49-AM-602x1024.png\" alt=\"\" class=\"wp-image-1696\" width=\"358\" height=\"608\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.14.49-AM-602x1024.png 602w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.14.49-AM-176x300.png 176w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.14.49-AM.png 608w\" sizes=\"auto, (max-width: 358px) 100vw, 358px\" \/><\/figure><\/div>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"592\" height=\"575\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/01\/box-1.png\" alt=\"\" class=\"wp-image-1697\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/box-1.png 592w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/box-1-300x291.png 300w\" sizes=\"auto, (max-width: 592px) 100vw, 592px\" \/><figcaption>plot showing an outlier<\/figcaption><\/figure>\n\n\n\n<p> As you can see in the above box plot, there is one record which is way off the rest of the data. This could possibly be an experimental error. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Find the Mean with and without Outliers<\/h2>\n\n\n\n<pre class=\"wp-block-code\"><code># mean on raw data with outliers\nh1 = np.mean(data&#91;'Height'].to_numpy())\nw1 = np.mean(data&#91;'Weight'].to_numpy())\nb1 = np.mean(data&#91;'BMI'].to_numpy())<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>def outliers_iqr(ys):\n    quartile_1, quartile_3 = np.percentile(ys, &#91;25, 75])\n    iqr = quartile_3 - quartile_1\n    lower_bound = quartile_1 - (iqr * 1.5)\n    upper_bound = quartile_3 + (iqr * 1.5)\n    return np.where((ys > upper_bound) | (ys &lt; lower_bound))\n\n# pass a 1 D array\ndef remove_outliers(array):\n  index_outliers = outliers_iqr(array)\n  outliers_removed = np.delete(array, index_outliers)\n  return outliers_removed\n\n# removing outliers and finding the mean\nh2 = np.mean(remove_outliers(data&#91;'Height'].to_numpy()))\nw2 = np.mean(remove_outliers(data&#91;'Weight'].to_numpy()))\nb2 = np.mean(remove_outliers(data&#91;'BMI'].to_numpy()))<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"230\" src=\"http:\/\/35.93.160.39\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.32.44-AM-1024x230.png\" alt=\"\" class=\"wp-image-1701\" srcset=\"http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.32.44-AM-1024x230.png 1024w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.32.44-AM-300x67.png 300w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.32.44-AM-768x172.png 768w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.32.44-AM-780x175.png 780w, http:\/\/write.muthu.co\/wp-content\/uploads\/2021\/01\/Screenshot-2021-01-19-at-7.32.44-AM.png 1168w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Like I mentioned in the beginning, keeping or removing outliers is a domain knowledge driven decision. You must use your knowledge to decide what values should be discarded from your dataset.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When dealing with a large number of data, It&#8217;s a good practice to remove any outliers before further processing unless there is a good reason to keep them. Outliers in simple words are datapoint which are unusually far away from the rest of the dataset. For the lazy, here is the code to find outliers [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1701,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[37,31],"tags":[],"class_list":["post-1689","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-statistics"],"_links":{"self":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1689","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/comments?post=1689"}],"version-history":[{"count":7,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1689\/revisions"}],"predecessor-version":[{"id":1706,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/posts\/1689\/revisions\/1706"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media\/1701"}],"wp:attachment":[{"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/media?parent=1689"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/categories?post=1689"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/write.muthu.co\/wp-json\/wp\/v2\/tags?post=1689"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}