Jyotiska NK

Personal Blog and Website

Simple Face Detection Using Python and OpenCV

| Comments

OpenCV has really good and stable bindings for Python. This makes things easier to develop image processing programs without leaving the comfort of Python or going to alternate solutions like SimpleCV. For past week, I have been tinkering with OpenCV in quite detail and it has been super fun. In this post, I will show how to run a simple face detection script using Python.

The prerequisites are OpenCV (I use 2.4.9 with Python binding) and Python 2.7+.

Latest OpenCV can be easily installed using:

1
brew install opencv

For complete Mac installation, follow this link. For Ubuntu installation, follow this link. Homebrew will install all the dependencies on a Mac.

We will run the face detection using Haar feature based cascade classifiers. OpenCV installation already comes with pre-trained classifiers for many facial keypoints such as eye, nose, mouth and the face itself. We will use the file haarcascade_frontalface_alt.xml which is available in OpenCV installation directory in /usr/local/Cellar/opencv/2.4.9/share/OpenCV/haarcascades. The path might vary based on the installation, but it should be inside the directory where OpenCV was installed. The XML file needs to be copied into the local directory where the Python face detection script will be present.

Following is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import cv2

img = cv2.imread("input_pic.jpg")
cascade = cv2.CascadeClassifier("haarcascade_frontalface_alt.xml")
rects = cascade.detectMultiScale(img, 1.3, 4, cv2.cv.CV_HAAR_SCALE_IMAGE, (20,20))

if len(rects) != 0:
    rects[:, 2:] += rects[:, :2]
    for x1, y1, x2, y2 in rects:
        cv2.rectangle(img, (x1, y1), (x2, y2), (127, 255, 0), 2)
    cv2.imshow('Face Detection',img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print "Could not detect face from the image."

Let us find out how this works out for us. I have used the following image on the left, cast from The Avengers movie. On the right is the face detected image.

Well, that was easy!

Atom by Github

| Comments

Github is bringing a new text editor names ‘Atom’. Though it looks somewhat like Sublime Text, its hoped to be good. Also, it has a Python package. The project is still invite-only and in beta mode. Check here for more.

Minimal 2048

| Comments

I decided to put on a little minimalist touch on the viral game 2048. I forked the repo from Github and removed the texts from the tiles which means the tiles no longer show you the numbers. What you ultimately have is a bunch of tiles of diffirent colors. I must say this looks better and since the colors were pretty distinctive from one another, it feels good to play. Check it out here

Buddhism and Modern Psychology

| Comments

Coursera is offering a free online course on Buddhism and Modern Psychology by Robert Wright, from Princeton University. Found this course through r/meditation on Reddit. This is going to be a perfect course for me given the topics and syllabus of the course. Also, this is what sets Coursera apart from the rest of the MOOC platforms. I previously took “Introduction to Philosophy” offered by University of Edinburgh (hope I spelled that correctly!). That course turned out to be surprisingly good. Hoping that this course will be as good if not better. The course can be accessed from here.

Thank you Coursera!

Vincent

| Comments

Vincent is a Python to Vega translator which helps to build easy visualizations on top of d3js with Vega. Now what is Vega? Vega is a visualization grammar for creating and declaring visualization designs.

Vincent converts native Python data structures and collections into Vega visualization grammar and also supports Pandas and iPython, including iPython Notebook! Vincent is available in Python package repository and can be easily downloaded using pip install vincent. Using Vincent is dead easy and it supports many visualization templates available for d3js. Next plan is to use Vincent for our visualization work at DataWeave :)

Source code of Vincent is free and available at Github here.

For example, the following creates a simple bar chart with available data:

1
2
3
4
import vincent
bar = vincent.Bar(multi_iter1['y1'])
bar.axis_titles(x='Index', y='Value')
bar.to_json('vega.json')

Pomodoro

| Comments

Pomodoro is an italian word which means ‘tomato’! However, Pomodoro is an effective and popular time management technique which uses a timer to break down a task into short intervals. Usually 25 minutes is set as an interval. The intervals are known as pomodori. There are total five steps to implement this (taken from Wikipedia):

  1. Decide what you want to do
  2. Set the pomodoro timer to 25 mins
  3. Keep working until timer goes off
  4. Take a short break for 5 mins
  5. Every four pomodori, take a longer break of 15 mins

Hope you find this useful!

Pickling in Python

| Comments

Python provides a Pickle module for serializing and de-serializing python objects or collections. There are other names for Pickling, such as flattening or marshalling etc. The process of converting a Python object into byte stream is called Pickling and the reverse process for getting back the object is known as Unpickiling. Python also provides cpickle which is written in C and can be upto 1000 times faster than the standard Pickle module.

One short example of Pickling is the following, which is given in the Python official documentation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pickle

data1 = {'a': [1, 2.0, 3, 4+6j],
         'b': ('string', u'Unicode string'),
                  'c': None}

selfref_list = [1, 2, 3]
selfref_list.append(selfref_list)

output = open('data.pkl', 'wb')

# Pickle dictionary using protocol 0.
pickle.dump(data1, output)

# Pickle the list using the highest protocol available.
pickle.dump(selfref_list, output, -1)

output.close()

However, I prefer to encode my collections and objects into JSON object using the simplejson module. It is neat and simple. One added advantage is you can actually see your data in readable format. JSON serialization and deserialization is quite fast and I never faced any problem except some unicode error every now and then. At DataWeave, single JSON file can reach upto a gigabyte and can be quite memory-intensive if you are loading the entire thing in your RAM. However, JSON can be a very good replacement for the pickle module.

In Spark project, we use cloudpickle module which is an improvement over the standard pickle module adding more features to it. There have been some proposals to use JSON for serialization but nothing is finalized yet.

Run PySpark With Data Stored in HDFS

| Comments

In previous post, I have shown how to run a PySpark job where the data stored in local file system. In order to access files stored in HDFS, we only need to specify the absolute path of the file in HDFS. Rest will work same.

1
2
3
from pyspark import SparkContext
sc = SparkContext("spark://master:7077//", "Test")
data = sc.textFile("hdfs://master/path/to/file")

Lambda Functions in Python

| Comments

Lambda functions are anonymous functions without any name which is created in the runtime. This is very close to functional programming supported by Python, and some of the methods like filter(), map(), reduce() are the most common usecases for lambda functions. Also, using clever lambda functions shortens the length of your code. But excessive and unnecessary usages create code unreadability and complicates the project. Example:

1
2
3
4
5
6
7
8
9
10
foo = [2, 18, 9, 17, 24, 8, 12, 27]

print filter(lambda x: x % 3 == 0, foo)
[18, 9, 24, 12, 27]

print map(lambda x: x * 2 + 10, foo)
[14, 46, 28, 54, 44, 58, 26, 34, 64]

print reduce(lambda x, y: x+y, foo)
139

Run WordCount Using PySpark

| Comments

In last post, I showed how to setup a Spark cluster on your ubuntu machine. In this post, I will explain how to run your first PySpark program which counts frequency of each word present in a document. WordCount is a famous example to show how a MapReduce program works. But for this example, it will help you to understand better the workflow in Spark.

To start the shell, use the following. I use iPython, show the second options shows how to start Spark shell using iPython:

1
2
./bin/pyspark # for running the default pyspark shell
IPYTHON_OPTS={qtconsole --pylab inline} ./bin/pyspark

Now if you have started the shell, run the following commands, I will explain each line subsequently.

1
2
3
4
5
6
from pyspark import SparkContext
sc = SparkContext("local", "pyspark")
text = sc.textFile("path of the text file")
counts = text.flatMap(lambda x: x.split()).map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)
for (word, count) in counts.collect():
    print word, count

That’s it! In 3 lines, it is possible to compute word count of gigabytes of data. First line imports the SparkContext from the pyspark package. Then, we create a context sc, where the master resides at localhost and the app name is pyspark. We use the context sc to open the text file on which we are going to compute word count. Finally we create a RDD, counts by performing flatMap() operation first. This creates a list of words from the entire document. We will cover RDD in detail in upcoming posts. Then, we perform map() on the list of words which gives every word frequency 1. After that, we use reduceByKey to reduce the words using their key and calculate their frequency. Finally, we perform collect method on the RDD count and retrieve all the words and their corresponding frequencies.

In next post, I will be discussing more examples with Python and details on Spark framework. Keep checking!