Data science – Muthukrishnan

May 15, 2021

Birthday Problem and Monte Carlo Simulation

If you get on a plane that can carry 100 or more passengers at a time, there is a 99% chance that one of the passengers shares the same birthday as yours. If you get into a bus with a capacity of 50 passengers, you have a 97% chance of finding someone who shares your […]

January 19, 2021

Understanding Interquartile Range (IQR) and Outliers

When dealing with a large number of data, It’s a good practice to remove any outliers before further processing unless there is a good reason to keep them. Outliers in simple words are datapoint which are unusually far away from the rest of the dataset. For the lazy, here is the code to find outliers […]

November 13, 2018

Data science

K-fold cross validation

The greatest headache for any machine learning engineer is the problem of overfitting. The model we trained works perfectly on the training dataset but when applied to other new dataset it fails miserably. This is because of overfitting where our classifier learns the provided dataset accurately but fails when applied on new data. One good […]

September 17, 2018

Linear Regression using Gradient Descent Algorithm

Gradient descent is an optimization method used to find the minimum value of a function by iteratively updating the parameters of the function. Parameters refer to coefficients in Linear Regression and weights in Neural Networks. In a linear regression problem, we find a modal that gives an approximate representation of our dataset. In the below […]

August 28, 2018

Mathematics of Principal component analysis

Principal component analysis is a method used to reduce the number of dimensions in a dataset without losing much information. It’s used in many fields such as face recognition and image compression and is a common technique for finding patterns in data and also in the visualization of higher-dimensional data. PCA is all about geometrically […]

July 7, 2018

Understanding the Classification report through sklearn

A classification report is used to measure the quality of predictions from a classification algorithm. It details how many predictions are true and how many are false. More specifically, true positives, false positives, true negatives, and false negatives are used to calculate the metrics of a classification report, as shown below. The report is copied […]

July 7, 2018

K-Means on Iris Dataset

Read my previous post to understand how K-Means algorithm works. In this post I will try to run the K-Means on Iris dataset to classify our 3 classes of flowers, Iris setosa, Iris versicolor, Iris virginica (our classess) using the flowers sepal-length, sepal-width, petal-length and petal-width (our features) Getting data: describe the data: Converting the class […]

July 7, 2018

Mathematics behind K-Mean Clustering algorithm

K-Means is one of the simplest unsupervised clustering algorithm which is used to cluster our data into K number of clusters. The algorithm iteratively assigns the data points to one of the K clusters based on how near the point is to the cluster centroid. The result of K-Means algorithm is: K number of cluster […]

June 30, 2018

Understanding Support vector Machines using Python

Support Vector machines (SVM) can be used for both classification as well as regression tasks but they are mostly used in classification applications. Some of the real world applications include Face detection, Handwriting detection, Document categorisation, SPAM Filtering, image classification and protein remote homology detection. For many researchers, SVM is the first best choice for […]

June 20, 2018

Data science

Simple example of Polynomial regression using Python

Previously I wrote an article explaining the underlying maths behind polynomial regression. In this post I will use Python libraries to regress a simple dataset to see polynomial regression in action. If you want to fully understand the internals I recommend you read my previous post. Polynomial regression is a method of finding an nth […]