Category Archives: data sciene

They all look the same. How do I classify ? :/

Today we will be dabbling into the world of classification analysis. In order to continue any further you need to know what classification analysis is and what is it used for. For those of you who are already familiar with what classification and regression mean continue to the subsection where I address the data we will be using today.



Machine learning nowadays is put into two bags called the supervised learning and unsupervised learning.

We will be discussing about supervised learning in our post today  .  A brief introduction description and difference between the is two is as follows .

Say you are given a dataset containing height ,weight ,length of hair of volunteers and you need to determine whether the particular volunteer is a male or female basically you want to determine the gender.

In supervised learning the height weight and length of hair value of each volunteer is  labelled with their individual gender and this data can be used to formulate a  formula to determine gender. This formula or model can be later used to predict whether a particular person is male or female.

Whereas in unsupervised learning we do not have labels to identify group into their genders but what the can generalise is that there are two groups that show similar traits and hence should belong to the same category we can give them labels such as FIRST or SECOND .

Supervised Machine learning is broadly divided into two functions. The first one being the classification analysis and second one being regression analysis. As you might have intuitively figured out classification refers to an act of distinguishing between various object or entities by labelling them with name or some unique ID in order to segregate the similar entities together for easy recognition later in the future.

Regression is more of predicting continuous values . Here we are not trying to classify anything but rather trying to find future values say the stock prices or commodity prices. We wont be talking in depth about this in this post but in the next we will.



There various application of machine learning for classification are , distinguishing between a plant and an animal in a picture using image processing techniques. In financial  Industries it is very essential to assess whether a particular investment is bad or good. The banking industry uses classification to categorise people who are likely to default on  Loans and people who will pay the debt on time. This is one aspect of machine which practically employed in our world today.

Today we will be dealing with classification of iris plant into 3 different categories.



The Iris data set is a data set available on the UCI machine learning repository website and here you can find different types of datasets to work around with and learn about machine learning.

The IRIS data set contains 3 classes of 50 instances each, where each class refers to a type iris plant.
IRIS data set and information can be found on this particular link .Iris Dataset.

These are the three categories of iris plant that we will try to classify and they are as follows.




Notice how there is only very subtle difference between the different categories . This is the drawback of human vision that can be overcome using machine learning processes.


we will be dealing with just basic visualization and very basic classifier using python to classify between iris setosa,iris versicolour,iris virginica.

For the following are libraries are needed to be installed.

If you are not able to install the packages properly check out this particular video to get you through to the end stages.

Also you can check out this websites to download Windows Binaries on this website Python ExtensionsIf you face any difficulty please drop a comment below the blog .


First we import pyplot from the matplotlib and then import the iris data set from the sklearn library  or scikit-learn library or module whatever you might wanna call it .We also import numpy module.

from matplotlib import pyplot as plt
from sklearn.datasets import  load_iris
import numpy as np 
import pandas as pd

now that all modules have been printed. we will load the data into a variable called data.

feature_names = loaded["feature_names"]

now we create a pandas dataframe for easy manipulation later while plotting various exploratory graphs.


now we join the two data frames by adding another column in features and adding values for labels in it.


load_iris is a class in sklearn datasets.
inorder to access the variable attributes of the class we use the object loaded.
we assign the various features of the data by intialising a variable called features and feature names were assigned to the variable feature_names.labels are the categories to be predicted in i.e. versicolour ,setosa and verginica.


now for basic plotting we use the plt method of the pyplot module

for i,color in zip(range(3),"rbg"):


what this code snipped does is ,it plots the sepal length vs sepal width value of different labels  and the color is also governed by the labels. This helps us in distinguishing which attribute can be used to separate the categories of the iris plant .One of the plots obtained is shown here. You dont really need to understand the code coz its just for exploratory purposes. If you want in depth explanation leave a request in the comments section . I’ll be obliged to do solve any queries .

X-axis is the is the Sepal length and Y-axis denotes the Petal length.

As can be seen from the plot of Sepal length vs Petal length that there is marked difference between Iris setosa and other two categories of iris plant and this can be used to form model which based on  the attributes sepal length and Petal length . Just by looking at the scatter plot  itself you can form a rule for yourself that if the sepal length is less than 6.0 and if the petal length is less than 2.5 the plant can be classified under setosa which is denoted by red dots in the plot. The red lines are called decision boundaries as shown below.figure_1

Now this is a very crude way of carrying out classification , but what if we have multiple properties or features which are needed to be used . What if we want to class all the three classes differently rather than just between setosa and non-setosa .

In such a situation we can use the SVM classifier. SVM stands for support vector machine .What support vector machine does is create decision boundaries for you automatically and if you cannot separate two categories like the versicolor and verginica  in our case , It transforms there features into alternate feature using kernel function. I known its a heavy word kernel function but all it means it is performs certain operations on these features so that the new features are separable linearly by straight decision boundaries. In a way the transformation function takes these features into alternate Dimensions. These decision boundaries might not be straight when brought back to their original dimensions. This video shows svm in action .

Another beautiful example of decision boundries can be seen on a zoo map where different species is separated by different path.




from sklearn import svm

what this code does is initialise a clf named object which is an object of the class svm. The fit method in this class is used to fit a model on the features and the labels. This operation of model fitting is done behind the scene and what you end up with is clf object which can be used to classify the iris plant if given the exact features which were provided while it was fitting the model.

 now lets try to predict using clf classifier object that we generated .For that we need features of the iris plant to be classified and we generate our own feature data by looking at the feature data used.


class 0 is setosa class and lets create our own feature set  as a variable and give the name unknown. unknown has 2 features set of two different plant which I made after manipulating the value of the feature set we have on hand.


now its time to predict . we use the method predict in the SVM module to predict.



array([0, 0])

output concurs with what we expected  output is array of size two. With each array element denoting setosa class which enumerated as 0. It is that easy to derive prediction out of classifier we built .

you are now  equipped with all the necessary knowledge to get you started with the basic level prediction . Do check out different type classifier and how they work .

and here is the whole code in one single snippet

from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
from sklearn import svm

feature_names = loaded["feature_names";]

"""now we create a pandas dataframe
for easy manipulation later while plotting various exploratory graphs."""


"""now we join the two data frames by adding another column in features
and adding values for labels in it."""


for i,color in zip(range(3),"rbg"):


#modeling the support vector machine


#now its time for prediction


print clf.predict(unkown)

I hope you liked this post and you are really angry about how long this blog post was. So heres a perfunctory GIF to demonstrate how exhausting writing a blog is .



Its you again DATA STRUCTURES!

hey there! ohk lets start by saying this .

I did try to make a video this time around about list comprehensions but I goofed up and I really need to improve upon my video editing skills before I go through posting any of it online ,so I might have to wait a while to put in a Vlog . I am going to continue where I left. List and working with list.

Till now I have covered how to initialise and work with list. Now some basic operations that list comes . These operations are called methods. getting back to the same example I told you guys before assume that you are going to a market place or Lidl or your local grocery store . So you make a list of things you want buy when you get there . Something similar to what we have here.

So you initialise a list name shopping_list

Shopping_list=[“milk”, “butter”,”Cheddar cheese”, “plain yougurt”]

and now you want to carry out following operations to your list say

  • Add a product to the list
  • Add multiple products to the list
  • Remove a particular product from the list
  • or maybe for some god knows reason why … You want the index of a particular product .. You know some people who just have these weird habits or maybe you have OCD . I dont know its your problem. Deal with it!

lets address these issues . Not the OCD one . You still gotta deal with that yourself.


ADDING a (single) PRODUCT to the list

ohk so what you do is use a simple method called append using the following syntax

shopping_list.append(“A pack of beer”)     

what this does is add the item “A pack of beer” to the end of the list. So when you try to print out your shopping_list. Using print shopping_list  command. what you get is this.

[“milk”, “butter”,”Cheddar cheese”, “plain yougurt”,”A pack of beer”] 

someone is having a party tonight .what if you want to add multiple items


ADDING multiple products to the shopping_list .

Now if you want to add multiple items . say haribo and ice creams ,salamis and so on . you can use the following method called extend

so shopping_list.extend([“ice cream”,”salami”,”haribos”])  Does the job for you .It will add it to the end of the these following items. Now remember parameter that you pass to extend have to in the form of an iterable or for now assume it to be same as a list.

now your list looks like this .

[“milk”, “butter”,”Cheddar cheese”, “plain yougurt”,”A pack of beer”,“ice cream”,”salami”,”haribos”


REMOVING a Product from the list.

You know those days you dream of having a lavish life and then you to peek into your wallet


and you are like .

Yeah you gotta remove those beer can .. sorry man.

so how do you go about it . you use the method called remove and it takes in as argument or parameter the particular element you want remove. so the exact value or item name is needed to be,

shopping_list.remove(“A pack of beer”)

this result in your shopping list looking like this

>>> [“milk”, “butter”,”Cheddar cheese”, “plain yougurt”,”A pack of beer“,“ice cream”,”salami”,”haribos”

it had to be done bro!!! sorry !

now there another method you can call to remove elements either from the end of the list or if you know the index.

shopping_list.pop()  removes the last element of the list . so there goes your haribos.

and if you know the index say you know you wrote down butter as the second element of the list. you use

shopping_list.pop(1)  second element =index 1 rememeber . which removes your butter from the list …that cholesterol aint gonna go down by itself.


FINDING INDEX of a particular product.

now for the retarded part .

if you wan to know at what index the particular item appears you can use.

shopping_list.index(“salami”) outputs the index number of where salami appears in the list.

by now you must be like SHUT UP ALREADY !!


so I’ll also drop the chalk over here and move to suggesting you to go through this particular exercise of codeacademy to build you basics and get complete grasp over the subject . I sincerely request you guys to through this PARTICULAR TUTORIAL(this is hyperlinked) where they teach you how to traverse through the list . and access each element .  I know I haven’t really covered much about machine learning but this is important as this is the base on which you build so .. check this particular link out . Next post I’ll start with pandas and something called as CSV files .. so keep up .





Knock Knock ? Who’s there ? Data Structures.. :|


As promised today Im dropping my post about data structures available in python.

For those of you who are not familiar with concept of data structures. Basically a Data Structure defines the way our data is stored so that it can be accessed efficiently.

AND good machine learning happens only when we have good quality data to create our features from. If you dont know what feature is , stick to this blog I will be posting about it later .

As for the data scientist out here . Remember most of the data scientist spend 70-80% of their time working to sort data.

Now the reality is that most of the data that is generated nowadays is found to be unstructured. This is similar to working at a store which looks something like this.

Lovely little shrine for bad quality  unstructured data

Will you be ready to assist someone in finding a product of their choice in such mess?   HELLL NOOO!!!.


So what you do is organise your data in very structured and thought off manner ,for easy access in the future and end up with something like this.

Praise the lord!!!
Praise the lord!!!

Someone’s job just got a lot easier.
As a machine learner or a someone who wants to be a DATA SCIENTIST.  YOU GOT TO STRUCTURE YOUR DATA. Now what kind of structure you use completely depends on you and how you want to access it . If speed is the need or space is the problem ,you might wanna try different structure similar to a store which might try a circular setup like shown here.candy-store-setup


what are our options? well when you talk about storing data in python the first thing that comes in our mind is list. I thought of creating a vlog about list but I’m kinda short on time so but I will put up a Vlog regarding the  list operations or comprehensions as we call them. Till then check out this introductory video by KHAN ACADEMY . He explains the nuances about the concept of List pretty concisely .

Now that you are familiar with concept of List . I will keep this post short . Next post I will be covering about list comprehensions that’ll be tomorrow or today  depending upon whenever I get the time. after learning about List comprehensions you should be able to script most of the algorithmic or Data Munging(Cleaning) requirements . So please keep up 🙂  . This will be followed by post about dictionary and sets .Never really used sets for Machine Learning but no harm knowing about it .

And sign UP if you dont want to miss out on the next post.







To start off

First off …. Thank you very much for paying a visit to this page.

Alright so we will be starting off with the python for machine learning blog today. Here what’s in store for you visitors out here.

Through this blog I’ll be initially posting and dealing with the general python language operations that are needed to be known before hand so that the coding can be done easily without wondering about the syntax of the language.

As we all know that effectiveness of a particular machine learning method/algorithm is fundamentally determined by the kind of features/Data   being fed to it . It is very essential to know how to handle this data. Hence general purpose knowledge of the language in our case python is very much essential.

As we gradually get accustomed to the language . we will be starting off with concepts in Machine Learning . I would assume that you are having atleast some knowledge regarding the machine learning and its uses . Later on we may introduce some concepts like Map-Reduce and other added bonus features.

This blog is specially for those who have little or minimal knowledge computer science …