Today we will be dabbling into the world of classification analysis. In order to continue any further you need to know what classification analysis is and what is it used for. For those of you who are already familiar with what classification and regression mean continue to the subsection where I address the data we will be using today.
INTRODUCTION TO MACHINE LEARNING
Machine learning nowadays is put into two bags called the supervised learning and unsupervised learning.
We will be discussing about supervised learning in our post today . A brief introduction description and difference between the is two is as follows .
Say you are given a dataset containing height ,weight ,length of hair of volunteers and you need to determine whether the particular volunteer is a male or female basically you want to determine the gender.
In supervised learning the height weight and length of hair value of each volunteer is labelled with their individual gender and this data can be used to formulate a formula to determine gender. This formula or model can be later used to predict whether a particular person is male or female.
Whereas in unsupervised learning we do not have labels to identify group into their genders but what the can generalise is that there are two groups that show similar traits and hence should belong to the same category we can give them labels such as FIRST or SECOND .
Supervised Machine learning is broadly divided into two functions. The first one being the classification analysis and second one being regression analysis. As you might have intuitively figured out classification refers to an act of distinguishing between various object or entities by labelling them with name or some unique ID in order to segregate the similar entities together for easy recognition later in the future.
Regression is more of predicting continuous values . Here we are not trying to classify anything but rather trying to find future values say the stock prices or commodity prices. We wont be talking in depth about this in this post but in the next we will.
There various application of machine learning for classification are , distinguishing between a plant and an animal in a picture using image processing techniques. In financial Industries it is very essential to assess whether a particular investment is bad or good. The banking industry uses classification to categorise people who are likely to default on Loans and people who will pay the debt on time. This is one aspect of machine which practically employed in our world today.
Today we will be dealing with classification of iris plant into 3 different categories.
The Iris data set is a data set available on the UCI machine learning repository website and here you can find different types of datasets to work around with and learn about machine learning.
The IRIS data set contains 3 classes of 50 instances each, where each class refers to a type iris plant.
IRIS data set and information can be found on this particular link .Iris Dataset.
These are the three categories of iris plant that we will try to classify and they are as follows.
Notice how there is only very subtle difference between the different categories . This is the drawback of human vision that can be overcome using machine learning processes.
GETTING STARTED IN PYTHON
we will be dealing with just basic visualization and very basic classifier using python to classify between iris setosa,iris versicolour,iris virginica.
For the following are libraries are needed to be installed.
If you are not able to install the packages properly check out this particular video to get you through to the end stages.
Also you can check out this websites to download Windows Binaries on this website Python Extensions. If you face any difficulty please drop a comment below the blog .
CODE SNIPPETS AND EXPLAINATION
First we import pyplot from the matplotlib and then import the iris data set from the sklearn library or scikit-learn library or module whatever you might wanna call it .We also import numpy module.
from matplotlib import pyplot as plt from sklearn.datasets import load_iris import numpy as np import pandas as pd
now that all modules have been printed. we will load the data into a variable called data.
loaded=load_iris() features=loaded["data"] feature_names = loaded["feature_names"] labels=loaded["target"]
now we create a pandas dataframe for easy manipulation later while plotting various exploratory graphs.
now we join the two data frames by adding another column in features and adding values for labels in it.
load_iris is a class in sklearn datasets.
inorder to access the variable attributes of the class we use the object loaded.
we assign the various features of the data by intialising a variable called features and feature names were assigned to the variable feature_names.labels are the categories to be predicted in i.e. versicolour ,setosa and verginica.
now for basic plotting we use the plt method of the pyplot module
for i,color in zip(range(3),"rbg"): plt.scatter(features[features==i],features[features==i],c=color[i],marker="o") show()
what this code snipped does is ,it plots the sepal length vs sepal width value of different labels and the color is also governed by the labels. This helps us in distinguishing which attribute can be used to separate the categories of the iris plant .One of the plots obtained is shown here. You dont really need to understand the code coz its just for exploratory purposes. If you want in depth explanation leave a request in the comments section . I’ll be obliged to do solve any queries .
X-axis is the is the Sepal length and Y-axis denotes the Petal length.
As can be seen from the plot of Sepal length vs Petal length that there is marked difference between Iris setosa and other two categories of iris plant and this can be used to form model which based on the attributes sepal length and Petal length . Just by looking at the scatter plot itself you can form a rule for yourself that if the sepal length is less than 6.0 and if the petal length is less than 2.5 the plant can be classified under setosa which is denoted by red dots in the plot. The red lines are called decision boundaries as shown below.
Now this is a very crude way of carrying out classification , but what if we have multiple properties or features which are needed to be used . What if we want to class all the three classes differently rather than just between setosa and non-setosa .
In such a situation we can use the SVM classifier. SVM stands for support vector machine .What support vector machine does is create decision boundaries for you automatically and if you cannot separate two categories like the versicolor and verginica in our case , It transforms there features into alternate feature using kernel function. I known its a heavy word kernel function but all it means it is performs certain operations on these features so that the new features are separable linearly by straight decision boundaries. In a way the transformation function takes these features into alternate Dimensions. These decision boundaries might not be straight when brought back to their original dimensions. This video shows svm in action .
Another beautiful example of decision boundries can be seen on a zoo map where different species is separated by different path.
CODE SNIPPET FOR SVM CLASSIFICATION
from sklearn import svm clf=svm.SVC() clf.fit(features,labels)
what this code does is initialise a clf named object which is an object of the class svm. The fit method in this class is used to fit a model on the features and the labels. This operation of model fitting is done behind the scene and what you end up with is clf object which can be used to classify the iris plant if given the exact features which were provided while it was fitting the model.
now lets try to predict using clf classifier object that we generated .For that we need features of the iris plant to be classified and we generate our own feature data by looking at the feature data used.
class 0 is setosa class and lets create our own feature set as a variable and give the name unknown. unknown has 2 features set of two different plant which I made after manipulating the value of the feature set we have on hand.
now its time to predict . we use the method predict in the SVM module to predict.
output concurs with what we expected output is array of size two. With each array element denoting setosa class which enumerated as 0. It is that easy to derive prediction out of classifier we built .
you are now equipped with all the necessary knowledge to get you started with the basic level prediction . Do check out different type classifier and how they work .
and here is the whole code in one single snippet
from matplotlib import pyplot as plt from sklearn.datasets import load_iris import numpy as np import pandas as pd from sklearn import svm laoded=load_iris() features=loaded["data"] feature_names = loaded["feature_names";] labels=loaded["target"] """now we create a pandas dataframe for easy manipulation later while plotting various exploratory graphs.""" features=pd.DataFrame(features) labels=pd.DataFrame(labels) """now we join the two data frames by adding another column in features and adding values for labels in it.""" features=labels for i,color in zip(range(3),"rbg"): plt.scatter(features[features==i],features[features==i],c=color[i],marker="o") show() #modeling the support vector machine clf=svm.SVC() clf.fit(features,labels) #now its time for prediction unknown=[[5.1,3.2,1.5,0.15],[4.6,3.0,1.6,0.3]] #predict print clf.predict(unkown)
I hope you liked this post and you are really angry about how long this blog post was. So heres a perfunctory GIF to demonstrate how exhausting writing a blog is .
DONT FORGET TO FOLLOW! 🙂