machine learning | Python For Machine Learning

For last couple of weeks I had gone on a voyage that took me around India to places I have always wanted to go. My parents decided that we should celebrate my birthday in the foothills of himalayas. Which was a brilliant, as it was something I longed for , a break from the regular city life and move into the hilly sides of the country with treacherous roads winding into blinds corners and tea garden in the north eastern state of india called sikkim. The journey was brilliant and stay was even better.

As it happens in all such journeys, we clicked numerous photographs posing in different stances to settle for the best ones. With so many photographs and my mums request to show all the clicks that were taken at a particular site I was bogged with a tedious task that required a lot of human effort and mundane to an extent that my head started aching . As I had no clue which folder that particular picture came from. I thought why not automate the whole process.

So I wrote a script and later added a GUI for my mum who is not tech savvy so, I made it into a very easy four click process. My software was able retrieve images from my PC of around 400 GB of Harddisk images that I was looking for. Now I am sharing it with you guys.

One of the best result I was able to get from this software was when I provided it with an image straight off the internet that resembled the picture I was looking for and I was able to retrieve that particular image from a heap of 5300 images in my drive in about 20 seconds. Results were even better when I used one image I had lying on my desktop and wanted to find the folder it belonged to in one of the drives. 😀

The image I took from the google images which was visually similar was this from Sonya and Travis Blog . Huge shout out for them 🙂 . Go check out their blog as well. So here is the picture

And I was able to retrieve this image from the Hard Disk. Along with the location to be able to find the rest of the related images.

After adding some GUI to the Script I ended up with this Search Engine/ Image Retrieval system which looked something like this :D.

By the end of this post you will be made capable of making your very own Image Search Engine like Tin Eye.

Here’s a quick little demo of what I was able to accomplish.

So lets get the skeleton of the image search engine ready . Lets get close and personal with the cool stuff.

BREAKING DOWN THE PROBLEM STATEMENT.

Our search engine will have the following feature in the order of execution as given below.

Selection of the Picture to be searched.
Selection for the directory where the search needs to be carried out.
Searching the directory for all the Pictures.
Creating feature index of the Pictures.
Evaluating the same feature for the search Picture.
Matching the pictures in our search.
outputting the matched Pictures.

SELECTION OF THE SEARCH PICTURE AND SEARCH DIRECTORY

First of all we will be requiring openCV for python to be able to continue with our tutorial. So Please download and install. A quick google search will help you get the required . Now other libraries that need to be imported are as follows.

import os
from os.path import join
import cv2
import numpy as np
import time as time
import scipy.spatial.distance as dist

Here we just need to know the address of the search image and the we need to specify the directory where the search needs to to carried out. Which looks something like this.

directory="D:\photo"
searchImage="C:\image11.jpg"

This is the format in which images and directory are passed into the system.
Now that we have the search Directory where we want to search.

SEARCHING THE DIRECTORY FOR ALL THE PICTURES AND MAKING AN INDEX

Now that we know the directory we want , the thing is you need to have an index of images to compare your search image. For this we need to crawl your computer looking for images of jpg format and find the features we will require for comparison later.

index={}
def find(directory):
      for (dirname,dirs,files) in os.walk(directory):
          for filename in files:
              if (filename.endswith(".JPG")):
                  fullpath=join(dirname,filename)
                  index[fullpath]=features(fullpath)

      print "total number of photos in this directory %s"%len(index)
      return index

Here we are using the module os.walk from the library os to scan through a particular directory for all the JPG files and then using our features function to generate feature for that particular photo or image and add it to the dictionary named index with the key as Image adress or fullpath of the image, so that we can know where that particular image was found for retrieving later.

FEATURE FUNCTION FOR OUR IMAGES

Now we are going to proceed to evaluating the features from our images. what are features btw? Now features is something that distinctly defines an image much like the skin tone of most of the indian people is brown as compared to someone from europe. These are features that might help you in distinguishing between them. Of Course there are other features like facial descriptions,voice which could be used as there no limit to the kind and number of features you use.

The one we are using here to define a feature is called histogram of color of image. It is basically a frequency plot of the intensity of the Red , blue and green color according to each pixel. This is one brilliant video about the histogram of color of images. check it out to understand in dept.

Now lets get started with defining the function.Our function takes the image location as the input and output the histogram value to the function.

def features(imageDirectory):
       img=cv2.imread(imageDirectory)
       histogram=cv2.calcHist([img],[0,1,2],None,  [8,8,8],,256,0,256,0,256])
       Nhistogram=cv2.normalize(histogram)
       return Nhistogram.flatten()

Now what we are doing here is using cv2(openCV for python) library to read the file then using the cv2 to generate a matrix containing the histogram value of the image.
Nhistogram is the normalised histogram.Normalising the histogram helps us make the image scale invariant. What that means that even if you increase the size of the image or decrease the size the image histogram thus produced will be always be similar If not exact.Normalising also helps us make the histogram robust to image rotation error. So even if the image is rotated 90 degree or any other value of rotation, the histogram will always remain similar to the original one.

Finally we flatten the histogram matrix and bringing its dimension down to 1 which is essentially a list of numerical value of pixel intensity.

Now we have our features ready.

COMPARING THE FEATURES.

As we have our feature function ready and hence we can find out the histogram value of any image.

All we need now is a function which helps us determine the ranks of the images after comparison and finally give us the top 10 images which look similar to our search image.

def search(SearchImage,SearchDir):
     histim=histogramvalue(SearchImage)
     allimages=find(SearchDir)
     match=top(histim,allimages)
     return match

Now that we have defined the function for searching and producing the search results. All we are left with is the write the function that defines top10 matches in the image search and produces the images that are very close to our images visually.

def top(histim,allimages):
      correlation={}
      for (address,value) in allimages.items():
      correlation[address]=cv2.compareHist(histim,value,cv2.cv.CV_COMP_CHISQR)
      ranked=sorted(correlation.items() ,key=lambda tup:       float(tup[1]))
      return ranked[0:10]

we are using chi-Square distance to measure the correlation between the two images .you can probably use other correlation parameters to find better result or even take the weighted average of the results for different correlation evaluation functions like city block or canberra distance. The lower the chi-squared distance more the chances of the two images to look similar. hence we use sorted.

sorted is used to sort the images in increasing order of chi squared distance value. The lesser the distance more the correlation. Hence the top10 results would be the top 10 entries in list returned by top function.

Now we have all the functions ready . All we need to do is call them in particular order to make our search effective.

first we need to define our search directory and search image.

directory="D:\photo"
searchImage="C:\image11.jpg"

We just need to pass these two parameters to the search function.

finalOutput=search(searchImage,directory)

final output is a list of tuple of the form (address,value)

We iterate through the list and display the images as ranked.

for imageAdd,Histvalue in finalOutput:
    image=cv2.imread(imageAdd)
    resized=cv2.resize(image,(0,0),fx=0.25,fy=0.25)
    cv2.imshow("image directory %s %s"% (imageAdd,Histvalue),resized)
    cv2.waitKey(0)

There you have it your very own Personalised Image search engine which works on your own data.

So If you want to Know How to develop the GUI for this particular script , which will open up a whole new world of programming and software development. SIGN UP 😀 using the follow option of this blog ,on your right hand side widgets column, to get a notification in you email when I post the next tutorial to develop this particular GUI.

f you have any Query or need Further explanations Leave it in the comments below.

Caveat, It really depends what calculations you are performing on the histogram that determines the search speed. If you use inbuilt functions which implement C they tend to take lot less time as compared to the self defined functions. But it all boils down to the trade off between accuracy and speed.

Python For Machine Learning

Learning through Experience

Tag Archives: machine learning

Your Very Own Personalised Image Search Engine using python.