Category Archives: computer science

Your Very Own Personalised Image Search Engine using python.

For last couple of  weeks I had gone on a voyage that took me around India to places I have always wanted to go. My parents decided that we should celebrate my birthday in the foothills of himalayas. Which was a brilliant, as it was something I longed for , a break from the regular city life and move into the hilly sides of the country with treacherous roads  winding into blinds corners and tea garden in the north eastern state of india called sikkim. The journey was brilliant and stay was even better.10582310_10203006296852231_1634530816_n

As it happens in all such journeys, we clicked numerous photographs posing in different  stances to settle for the best ones. With so many photographs and my mums request to show all the clicks that were taken at a particular site I was bogged with a tedious task that required a lot of human effort and mundane to an extent that my  head started  aching . As I had no clue which folder that particular picture came from. I thought why not automate the whole process.

So I wrote a script and later added a GUI for my mum who is not tech savvy so, I made it into a very easy four click process. My software was able retrieve images from my PC of around 400 GB of Harddisk images that I was looking for. Now I am sharing it with you guys.

One of the best result I was able to get from this software was when I provided it with an image straight off  the internet that resembled the picture I was looking for and I was able to retrieve that particular image from a heap of 5300 images in my drive in about 20 seconds. Results were even better when I used one image I had lying on my desktop and wanted to find the folder it belonged to in one of the drives. 😀

The image I took from the google images which was visually similar was this from Sonya and Travis  Blog . Huge shout out for them 🙂 . Go check out their blog as well.  So here is the picture


And I was able to retrieve this image from the  Hard Disk. Along with the location to be able to find the rest of the related images.


After adding some GUI to the Script I ended up with this Search Engine/ Image Retrieval system which looked something like this :D.

Search Engine


By the end of this post you will be made capable of making your very own Image Search Engine like Tin Eye.

Here’s a quick little demo of what I was able to accomplish.

So lets get the skeleton of the image search engine ready . Lets get close and personal with the cool stuff.


Our search engine will have the following feature in the order of execution as given below.

  • Selection of the Picture to be searched.
  • Selection for the directory where the search needs to be carried out.
  • Searching the directory for all the Pictures.
  • Creating feature index of the Pictures.
  • Evaluating the same feature for the search Picture.
  • Matching the pictures in our search.
  • outputting the matched Pictures.


First of all we will be requiring openCV for python to be able to continue with our tutorial. So Please download and install. A quick google search will help you get the required . Now other libraries that need to be imported are as follows.

import os
from os.path import join
import cv2
import numpy as np
import time as time
import scipy.spatial.distance as dist

Here we just need to know the address of the search image and the we need to specify the directory where the search needs to to carried out.  Which looks something like this.


This is the format in which images and directory are passed into the system.
Now that we have the search Directory where we want to search.


Now that we know the directory we want , the thing is you need to have an index of images to compare your search image. For this we need to crawl your computer looking for images of jpg format and find the features we will require for comparison later.

def find(directory):
      for (dirname,dirs,files) in os.walk(directory):
          for filename in files:
              if (filename.endswith(".JPG")):

      print "total number of photos in this directory %s"%len(index)
      return index

Here we are using the module os.walk from the library os to scan through a particular directory for all the JPG files and then using our features function to generate feature for that particular photo or image and add it to the dictionary named index with the key as Image adress or fullpath of the image, so that we can know where that particular image was found for retrieving later.


Now we are going to proceed to evaluating the features from our images.  what are features btw? Now features is something that distinctly defines an image much like the skin tone of most of the indian people is brown as compared to someone from europe. These are features that might help you in distinguishing between them. Of Course there are other features like facial descriptions,voice which could be used as there no limit to the kind and number of features you use.

The one we are using here to define a feature is called histogram of color of  image. It is basically a frequency plot of the intensity of the Red , blue and green color according to each pixel. This is one brilliant video about the histogram of color of images. check it out to understand in dept.

Now lets get started with defining the function.Our function takes the image location as the input and output the histogram value to the function.

def features(imageDirectory):
       histogram=cv2.calcHist([img],[0,1,2],None,  [8,8,8],,256,0,256,0,256])
       return Nhistogram.flatten()

Now what we are doing here is using cv2(openCV for python) library to read the file then using the cv2 to generate a matrix containing the histogram value of the image.
Nhistogram is the normalised histogram.Normalising the histogram helps us make the image scale invariant. What that means that even if you increase the size of the image or decrease the size the image histogram thus produced will be always be similar If not exact.Normalising also helps us make the histogram robust to image rotation error. So even if the image is rotated 90 degree or any other value of rotation, the histogram will always remain similar to the original one.

Finally we flatten the histogram matrix and bringing its dimension down to 1 which is essentially a list of numerical value of pixel intensity.

Now we have our features ready.



As we have our feature function ready and hence we can find out the histogram value of any image.

All we need now is a function which helps us determine the ranks of the images after comparison and finally give us the top 10 images which look similar to our search image.

def search(SearchImage,SearchDir):
     return match

Now that we have defined the function for searching and producing the search results. All we are left with is the write the function that defines top10 matches in the image search and produces the images that are very close to our images visually.

def top(histim,allimages):
      for (address,value) in allimages.items():
      ranked=sorted(correlation.items() ,key=lambda tup:       float(tup[1]))
      return ranked[0:10]

we are using chi-Square distance to measure the correlation between the two images .you can probably use other correlation parameters to find better result or even take the weighted average of the results for different correlation evaluation functions like city block or canberra distance. The lower the chi-squared distance more the chances of the two images to look similar. hence we use sorted.

sorted is used to sort the images in increasing order of chi squared distance value. The lesser the distance more the correlation. Hence the top10 results would be the top 10 entries in list returned by top function.

Now we have all the functions ready . All we need to do is call them in particular order to make our search effective.

first we need to define our search directory and search image.


We just need to pass these two parameters to the search function.


final output is a list of tuple of the form (address,value)

We iterate through the list and display the images as ranked.

for imageAdd,Histvalue in finalOutput:
    cv2.imshow("image directory %s %s"% (imageAdd,Histvalue),resized)

There you have it your very own Personalised Image search engine which works on your own data.

So If you want to Know How to develop the GUI for this particular script , which will open up a whole new world of programming and software development. SIGN UP  😀 using the follow option of this  blog ,on your right hand side widgets column, to get a notification in you email when I post the next tutorial to develop this particular GUI.

f you have any Query or  need Further explanations  Leave it in the comments below.

Caveat, It really depends what calculations you are performing on the histogram that determines the search speed. If you use inbuilt functions which implement C they tend to take lot less time as compared to the self defined functions. But it all boils down to the trade off between accuracy and speed.


They all look the same. How do I classify ? :/

Today we will be dabbling into the world of classification analysis. In order to continue any further you need to know what classification analysis is and what is it used for. For those of you who are already familiar with what classification and regression mean continue to the subsection where I address the data we will be using today.



Machine learning nowadays is put into two bags called the supervised learning and unsupervised learning.

We will be discussing about supervised learning in our post today  .  A brief introduction description and difference between the is two is as follows .

Say you are given a dataset containing height ,weight ,length of hair of volunteers and you need to determine whether the particular volunteer is a male or female basically you want to determine the gender.

In supervised learning the height weight and length of hair value of each volunteer is  labelled with their individual gender and this data can be used to formulate a  formula to determine gender. This formula or model can be later used to predict whether a particular person is male or female.

Whereas in unsupervised learning we do not have labels to identify group into their genders but what the can generalise is that there are two groups that show similar traits and hence should belong to the same category we can give them labels such as FIRST or SECOND .

Supervised Machine learning is broadly divided into two functions. The first one being the classification analysis and second one being regression analysis. As you might have intuitively figured out classification refers to an act of distinguishing between various object or entities by labelling them with name or some unique ID in order to segregate the similar entities together for easy recognition later in the future.

Regression is more of predicting continuous values . Here we are not trying to classify anything but rather trying to find future values say the stock prices or commodity prices. We wont be talking in depth about this in this post but in the next we will.



There various application of machine learning for classification are , distinguishing between a plant and an animal in a picture using image processing techniques. In financial  Industries it is very essential to assess whether a particular investment is bad or good. The banking industry uses classification to categorise people who are likely to default on  Loans and people who will pay the debt on time. This is one aspect of machine which practically employed in our world today.

Today we will be dealing with classification of iris plant into 3 different categories.



The Iris data set is a data set available on the UCI machine learning repository website and here you can find different types of datasets to work around with and learn about machine learning.

The IRIS data set contains 3 classes of 50 instances each, where each class refers to a type iris plant.
IRIS data set and information can be found on this particular link .Iris Dataset.

These are the three categories of iris plant that we will try to classify and they are as follows.




Notice how there is only very subtle difference between the different categories . This is the drawback of human vision that can be overcome using machine learning processes.


we will be dealing with just basic visualization and very basic classifier using python to classify between iris setosa,iris versicolour,iris virginica.

For the following are libraries are needed to be installed.

If you are not able to install the packages properly check out this particular video to get you through to the end stages.

Also you can check out this websites to download Windows Binaries on this website Python ExtensionsIf you face any difficulty please drop a comment below the blog .


First we import pyplot from the matplotlib and then import the iris data set from the sklearn library  or scikit-learn library or module whatever you might wanna call it .We also import numpy module.

from matplotlib import pyplot as plt
from sklearn.datasets import  load_iris
import numpy as np 
import pandas as pd

now that all modules have been printed. we will load the data into a variable called data.

feature_names = loaded["feature_names"]

now we create a pandas dataframe for easy manipulation later while plotting various exploratory graphs.


now we join the two data frames by adding another column in features and adding values for labels in it.


load_iris is a class in sklearn datasets.
inorder to access the variable attributes of the class we use the object loaded.
we assign the various features of the data by intialising a variable called features and feature names were assigned to the variable feature_names.labels are the categories to be predicted in i.e. versicolour ,setosa and verginica.


now for basic plotting we use the plt method of the pyplot module

for i,color in zip(range(3),"rbg"):


what this code snipped does is ,it plots the sepal length vs sepal width value of different labels  and the color is also governed by the labels. This helps us in distinguishing which attribute can be used to separate the categories of the iris plant .One of the plots obtained is shown here. You dont really need to understand the code coz its just for exploratory purposes. If you want in depth explanation leave a request in the comments section . I’ll be obliged to do solve any queries .

X-axis is the is the Sepal length and Y-axis denotes the Petal length.

As can be seen from the plot of Sepal length vs Petal length that there is marked difference between Iris setosa and other two categories of iris plant and this can be used to form model which based on  the attributes sepal length and Petal length . Just by looking at the scatter plot  itself you can form a rule for yourself that if the sepal length is less than 6.0 and if the petal length is less than 2.5 the plant can be classified under setosa which is denoted by red dots in the plot. The red lines are called decision boundaries as shown below.figure_1

Now this is a very crude way of carrying out classification , but what if we have multiple properties or features which are needed to be used . What if we want to class all the three classes differently rather than just between setosa and non-setosa .

In such a situation we can use the SVM classifier. SVM stands for support vector machine .What support vector machine does is create decision boundaries for you automatically and if you cannot separate two categories like the versicolor and verginica  in our case , It transforms there features into alternate feature using kernel function. I known its a heavy word kernel function but all it means it is performs certain operations on these features so that the new features are separable linearly by straight decision boundaries. In a way the transformation function takes these features into alternate Dimensions. These decision boundaries might not be straight when brought back to their original dimensions. This video shows svm in action .

Another beautiful example of decision boundries can be seen on a zoo map where different species is separated by different path.




from sklearn import svm

what this code does is initialise a clf named object which is an object of the class svm. The fit method in this class is used to fit a model on the features and the labels. This operation of model fitting is done behind the scene and what you end up with is clf object which can be used to classify the iris plant if given the exact features which were provided while it was fitting the model.

 now lets try to predict using clf classifier object that we generated .For that we need features of the iris plant to be classified and we generate our own feature data by looking at the feature data used.


class 0 is setosa class and lets create our own feature set  as a variable and give the name unknown. unknown has 2 features set of two different plant which I made after manipulating the value of the feature set we have on hand.


now its time to predict . we use the method predict in the SVM module to predict.



array([0, 0])

output concurs with what we expected  output is array of size two. With each array element denoting setosa class which enumerated as 0. It is that easy to derive prediction out of classifier we built .

you are now  equipped with all the necessary knowledge to get you started with the basic level prediction . Do check out different type classifier and how they work .

and here is the whole code in one single snippet

from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd
from sklearn import svm

feature_names = loaded["feature_names";]

"""now we create a pandas dataframe
for easy manipulation later while plotting various exploratory graphs."""


"""now we join the two data frames by adding another column in features
and adding values for labels in it."""


for i,color in zip(range(3),"rbg"):


#modeling the support vector machine


#now its time for prediction


print clf.predict(unkown)

I hope you liked this post and you are really angry about how long this blog post was. So heres a perfunctory GIF to demonstrate how exhausting writing a blog is .


Its you again DATA STRUCTURES!

hey there! ohk lets start by saying this .

I did try to make a video this time around about list comprehensions but I goofed up and I really need to improve upon my video editing skills before I go through posting any of it online ,so I might have to wait a while to put in a Vlog . I am going to continue where I left. List and working with list.

Till now I have covered how to initialise and work with list. Now some basic operations that list comes . These operations are called methods. getting back to the same example I told you guys before assume that you are going to a market place or Lidl or your local grocery store . So you make a list of things you want buy when you get there . Something similar to what we have here.

So you initialise a list name shopping_list

Shopping_list=[“milk”, “butter”,”Cheddar cheese”, “plain yougurt”]

and now you want to carry out following operations to your list say

  • Add a product to the list
  • Add multiple products to the list
  • Remove a particular product from the list
  • or maybe for some god knows reason why … You want the index of a particular product .. You know some people who just have these weird habits or maybe you have OCD . I dont know its your problem. Deal with it!

lets address these issues . Not the OCD one . You still gotta deal with that yourself.


ADDING a (single) PRODUCT to the list

ohk so what you do is use a simple method called append using the following syntax

shopping_list.append(“A pack of beer”)     

what this does is add the item “A pack of beer” to the end of the list. So when you try to print out your shopping_list. Using print shopping_list  command. what you get is this.

[“milk”, “butter”,”Cheddar cheese”, “plain yougurt”,”A pack of beer”] 

someone is having a party tonight .what if you want to add multiple items


ADDING multiple products to the shopping_list .

Now if you want to add multiple items . say haribo and ice creams ,salamis and so on . you can use the following method called extend

so shopping_list.extend([“ice cream”,”salami”,”haribos”])  Does the job for you .It will add it to the end of the these following items. Now remember parameter that you pass to extend have to in the form of an iterable or for now assume it to be same as a list.

now your list looks like this .

[“milk”, “butter”,”Cheddar cheese”, “plain yougurt”,”A pack of beer”,“ice cream”,”salami”,”haribos”


REMOVING a Product from the list.

You know those days you dream of having a lavish life and then you to peek into your wallet


and you are like .

Yeah you gotta remove those beer can .. sorry man.

so how do you go about it . you use the method called remove and it takes in as argument or parameter the particular element you want remove. so the exact value or item name is needed to be,

shopping_list.remove(“A pack of beer”)

this result in your shopping list looking like this

>>> [“milk”, “butter”,”Cheddar cheese”, “plain yougurt”,”A pack of beer“,“ice cream”,”salami”,”haribos”

it had to be done bro!!! sorry !

now there another method you can call to remove elements either from the end of the list or if you know the index.

shopping_list.pop()  removes the last element of the list . so there goes your haribos.

and if you know the index say you know you wrote down butter as the second element of the list. you use

shopping_list.pop(1)  second element =index 1 rememeber . which removes your butter from the list …that cholesterol aint gonna go down by itself.


FINDING INDEX of a particular product.

now for the retarded part .

if you wan to know at what index the particular item appears you can use.

shopping_list.index(“salami”) outputs the index number of where salami appears in the list.

by now you must be like SHUT UP ALREADY !!


so I’ll also drop the chalk over here and move to suggesting you to go through this particular exercise of codeacademy to build you basics and get complete grasp over the subject . I sincerely request you guys to through this PARTICULAR TUTORIAL(this is hyperlinked) where they teach you how to traverse through the list . and access each element .  I know I haven’t really covered much about machine learning but this is important as this is the base on which you build so .. check this particular link out . Next post I’ll start with pandas and something called as CSV files .. so keep up .





Knock Knock ? Who’s there ? Data Structures.. :|


As promised today Im dropping my post about data structures available in python.

For those of you who are not familiar with concept of data structures. Basically a Data Structure defines the way our data is stored so that it can be accessed efficiently.

AND good machine learning happens only when we have good quality data to create our features from. If you dont know what feature is , stick to this blog I will be posting about it later .

As for the data scientist out here . Remember most of the data scientist spend 70-80% of their time working to sort data.

Now the reality is that most of the data that is generated nowadays is found to be unstructured. This is similar to working at a store which looks something like this.

Lovely little shrine for bad quality  unstructured data

Will you be ready to assist someone in finding a product of their choice in such mess?   HELLL NOOO!!!.


So what you do is organise your data in very structured and thought off manner ,for easy access in the future and end up with something like this.

Praise the lord!!!
Praise the lord!!!

Someone’s job just got a lot easier.
As a machine learner or a someone who wants to be a DATA SCIENTIST.  YOU GOT TO STRUCTURE YOUR DATA. Now what kind of structure you use completely depends on you and how you want to access it . If speed is the need or space is the problem ,you might wanna try different structure similar to a store which might try a circular setup like shown here.candy-store-setup


what are our options? well when you talk about storing data in python the first thing that comes in our mind is list. I thought of creating a vlog about list but I’m kinda short on time so but I will put up a Vlog regarding the  list operations or comprehensions as we call them. Till then check out this introductory video by KHAN ACADEMY . He explains the nuances about the concept of List pretty concisely .

Now that you are familiar with concept of List . I will keep this post short . Next post I will be covering about list comprehensions that’ll be tomorrow or today  depending upon whenever I get the time. after learning about List comprehensions you should be able to script most of the algorithmic or Data Munging(Cleaning) requirements . So please keep up 🙂  . This will be followed by post about dictionary and sets .Never really used sets for Machine Learning but no harm knowing about it .

And sign UP if you dont want to miss out on the next post.







To start off

First off …. Thank you very much for paying a visit to this page.

Alright so we will be starting off with the python for machine learning blog today. Here what’s in store for you visitors out here.

Through this blog I’ll be initially posting and dealing with the general python language operations that are needed to be known before hand so that the coding can be done easily without wondering about the syntax of the language.

As we all know that effectiveness of a particular machine learning method/algorithm is fundamentally determined by the kind of features/Data   being fed to it . It is very essential to know how to handle this data. Hence general purpose knowledge of the language in our case python is very much essential.

As we gradually get accustomed to the language . we will be starting off with concepts in Machine Learning . I would assume that you are having atleast some knowledge regarding the machine learning and its uses . Later on we may introduce some concepts like Map-Reduce and other added bonus features.

This blog is specially for those who have little or minimal knowledge computer science …