Jyotiska NK

Personal Blog and Website

ColorWeave Ported in Go

| Comments

This week I ported ColorWeave in Go. The original Python code extracted dominant colors from an image. I was curious to find out whether the same can be done with Go also. Since, Go is more popular to build web applications and seldom used for processing images. It has pretty good image library with lots of options. But writing 100 lines of code took me about 3-4 hours, thanks to golang’s strict type systems.

Also, I used webcolors library find the closest color from RGB value of every pixel and naming them in CSS 2.1 or CSS 3 specification. This made the job a bit easier, as I had already written the library a couple of months back. The code is pretty flexible and takes couple of parameters. It can take how many top dominant colors to show and also takes the specification the output should follow. If the user wants the base colors, then CSS 2.1 is the ideal one, CSS 3 otherwise for more shades. This is not a library and meant to be run as a standalone application. Although, it can be used a library too and can be coupled with other programs.

Now the experiment is over, I will go back to making APIs. There is meetup coming up in Bangalore, which is going to be exciting. BTW, the code is available on Github and can be found here.

Porting API From Bottle to Flask

| Comments

Last week, we decided to move a major API from Bottle to Flask. If you have played with micro web frameworks before, you must have come across both of them. When we started the prototype, we decided to stick to Bottle because it was a tiny library and we did not have to worry about scaling and rendering templates. Bottle worked fine for us for 3-4 months. We had to suffer occassional crashes and sometimes difficulty parsing the form parameters, but Bottle survived.

Now, the API is close to 2000 lines of code and Bottle is starting to break. We also want to render HTML templates, handle POST form requests and use gevent with the API. Even though it is possible to make Bottle do all these, but we are not really sure whether Bottle can withstand all these in a production environment. So, we started to look for other alternatives. There were a few. Vikranth uses CherryPy. But also mentioned that he has seen few segmentation faults while handling some MySQL queries. Tornado would be a good choice, but also its a bigger cannon than we need. Finally, we settled on Flask. Flask has the reputation of being an all-rounder web framework. It can handle high-traffic APIs, render large forms/HTML documents without breaking a sweat and can be coupled with gevent. Flask met all the requirements we needed.

We had previous experiences with Flask. In fact, I have used Flask (and still do) for some other projects. So, the transition went smooth. Bottle and Flask share syntaxes, whether it is getting parameters passed with a GET request or routing API calls to different handles, there were very less code rewrites. In fact, we mostly used bulk find and replaces for most of these cases. Thats it, we had ported our API codebase without bloating it up or writing extra lines of codes. Flask comes with debugger and console, which comes as blessings when testing our new API endpoints.

Next thing we are trying out is to couple Flask with our internal Celery cluster, to schedule our crawls and jobs using an interface. We will share as we have further updates.

Simple Face Detection Using Python and OpenCV

| Comments

OpenCV has really good and stable bindings for Python. This makes things easier to develop image processing programs without leaving the comfort of Python or going to alternate solutions like SimpleCV. For past week, I have been tinkering with OpenCV in quite detail and it has been super fun. In this post, I will show how to run a simple face detection script using Python.

The prerequisites are OpenCV (I use 2.4.9 with Python binding) and Python 2.7+.

Latest OpenCV can be easily installed using:

1
brew install opencv

For complete Mac installation, follow this link. For Ubuntu installation, follow this link. Homebrew will install all the dependencies on a Mac.

We will run the face detection using Haar feature based cascade classifiers. OpenCV installation already comes with pre-trained classifiers for many facial keypoints such as eye, nose, mouth and the face itself. We will use the file haarcascade_frontalface_alt.xml which is available in OpenCV installation directory in /usr/local/Cellar/opencv/2.4.9/share/OpenCV/haarcascades. The path might vary based on the installation, but it should be inside the directory where OpenCV was installed. The XML file needs to be copied into the local directory where the Python face detection script will be present.

Following is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import cv2

img = cv2.imread("input_pic.jpg")
cascade = cv2.CascadeClassifier("haarcascade_frontalface_alt.xml")
rects = cascade.detectMultiScale(img, 1.3, 4, cv2.cv.CV_HAAR_SCALE_IMAGE, (20,20))

if len(rects) != 0:
    rects[:, 2:] += rects[:, :2]
    for x1, y1, x2, y2 in rects:
        cv2.rectangle(img, (x1, y1), (x2, y2), (127, 255, 0), 2)
    cv2.imshow('Face Detection',img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()
else:
    print "Could not detect face from the image."

Let us find out how this works out for us. I have used the following image on the left, cast from The Avengers movie. On the right is the face detected image.

Well, that was easy!

Atom by Github

| Comments

Github is bringing a new text editor names ‘Atom’. Though it looks somewhat like Sublime Text, its hoped to be good. Also, it has a Python package. The project is still invite-only and in beta mode. Check here for more.

Minimal 2048

| Comments

I decided to put on a little minimalist touch on the viral game 2048. I forked the repo from Github and removed the texts from the tiles which means the tiles no longer show you the numbers. What you ultimately have is a bunch of tiles of diffirent colors. I must say this looks better and since the colors were pretty distinctive from one another, it feels good to play. Check it out here

Buddhism and Modern Psychology

| Comments

Coursera is offering a free online course on Buddhism and Modern Psychology by Robert Wright, from Princeton University. Found this course through r/meditation on Reddit. This is going to be a perfect course for me given the topics and syllabus of the course. Also, this is what sets Coursera apart from the rest of the MOOC platforms. I previously took “Introduction to Philosophy” offered by University of Edinburgh (hope I spelled that correctly!). That course turned out to be surprisingly good. Hoping that this course will be as good if not better. The course can be accessed from here.

Thank you Coursera!

Vincent

| Comments

Vincent is a Python to Vega translator which helps to build easy visualizations on top of d3js with Vega. Now what is Vega? Vega is a visualization grammar for creating and declaring visualization designs.

Vincent converts native Python data structures and collections into Vega visualization grammar and also supports Pandas and iPython, including iPython Notebook! Vincent is available in Python package repository and can be easily downloaded using pip install vincent. Using Vincent is dead easy and it supports many visualization templates available for d3js. Next plan is to use Vincent for our visualization work at DataWeave :)

Source code of Vincent is free and available at Github here.

For example, the following creates a simple bar chart with available data:

1
2
3
4
import vincent
bar = vincent.Bar(multi_iter1['y1'])
bar.axis_titles(x='Index', y='Value')
bar.to_json('vega.json')

Pomodoro

| Comments

Pomodoro is an italian word which means ‘tomato’! However, Pomodoro is an effective and popular time management technique which uses a timer to break down a task into short intervals. Usually 25 minutes is set as an interval. The intervals are known as pomodori. There are total five steps to implement this (taken from Wikipedia):

  1. Decide what you want to do
  2. Set the pomodoro timer to 25 mins
  3. Keep working until timer goes off
  4. Take a short break for 5 mins
  5. Every four pomodori, take a longer break of 15 mins

Hope you find this useful!

Pickling in Python

| Comments

Python provides a Pickle module for serializing and de-serializing python objects or collections. There are other names for Pickling, such as flattening or marshalling etc. The process of converting a Python object into byte stream is called Pickling and the reverse process for getting back the object is known as Unpickiling. Python also provides cpickle which is written in C and can be upto 1000 times faster than the standard Pickle module.

One short example of Pickling is the following, which is given in the Python official documentation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import pickle

data1 = {'a': [1, 2.0, 3, 4+6j],
         'b': ('string', u'Unicode string'),
                  'c': None}

selfref_list = [1, 2, 3]
selfref_list.append(selfref_list)

output = open('data.pkl', 'wb')

# Pickle dictionary using protocol 0.
pickle.dump(data1, output)

# Pickle the list using the highest protocol available.
pickle.dump(selfref_list, output, -1)

output.close()

However, I prefer to encode my collections and objects into JSON object using the simplejson module. It is neat and simple. One added advantage is you can actually see your data in readable format. JSON serialization and deserialization is quite fast and I never faced any problem except some unicode error every now and then. At DataWeave, single JSON file can reach upto a gigabyte and can be quite memory-intensive if you are loading the entire thing in your RAM. However, JSON can be a very good replacement for the pickle module.

In Spark project, we use cloudpickle module which is an improvement over the standard pickle module adding more features to it. There have been some proposals to use JSON for serialization but nothing is finalized yet.

Run PySpark With Data Stored in HDFS

| Comments

In previous post, I have shown how to run a PySpark job where the data stored in local file system. In order to access files stored in HDFS, we only need to specify the absolute path of the file in HDFS. Rest will work same.

1
2
3
from pyspark import SparkContext
sc = SparkContext("spark://master:7077//", "Test")
data = sc.textFile("hdfs://master/path/to/file")