Jyotiska NK

Technology, Philosophy and Minimalism

Why MongoDB Fits the Bill for Building Our Assortment API

Till today we have tried out and used plenty of databases. Some of them are relational and some of them are document storage and column storage. We have tried out MySQL, Cassandra, Redis, MongoDB. But our Assortment API poses an interesting challenge to be handled carefully. Before going into the details on how we actually handled it, lets shed some light on what assortment means. Each website has their own set of products on different categories. Apart from categories, there are number of important attributes for a product – subcategory, product type, price, discount etc. Each source tries their best to showcase their products competitively with respect to other websites. Our assortment API provides intelligence on the assortment of products for each of these sources. For example, the customer would like to know how many products each source have for jeans category within the price range 1000 to 2000 by different brands.

The challenge is to build an API flexible enough to sustain all the query requirements. This will need a database and very expressive query language. The queries can vary category by category. The customer might like to view apparel assortments by different sizes, fit and price range. On the other hand, another customer might want to view assortments on laptop products by RAM, processor speed and price range. It would be unwise to have one assortment API per category. The challenge from an engineering perspective is how to accommodate every request into one solid API with great flexibility under the hood which will serve any request at scale.

We have found MongoDB to do just that. Its query language is great and expressive. The aggregation framework is robust and the Python driver serves all your needs. Since we have pushed its aggregation framework to its limits, I will spend few more sentences praising it. Thanks to its pipeline, you can push all your group, sort, project, unwind, skip, limit and many more into one single query and still get back your data in real time. Currently the DB has few hundred thousands of products in each collection but still does not fail to deliver the result in subsecond latency. This happens in a single Mongo server! Things are bound to get interesting later on when we spread the data across multiple nodes and shards.

Things are looking promising at this point and we believe that we should be able to scale this assortment API with time serving from few million to hundreds of millions of products in future.

JSON vs simpleJSON vs ultraJSON

Its been a while since I posted in my blog. Nothing can be better than to kickstart blogging with a post about benchmarks. Without argument, the most used file format used in DataWeave is JSON. We use JSON everywhere – product lists, crawl dumps, assortments and almost everywhere. So it is very important to make sure that the libraries and packages we are using are fast enough to handle large datasets. There are two popular packages used for handling json – first is the stock ‘json’ package that comes with default installation of Python, the other one is simplejson which is an optimized and maintained json package for Python. The goal of this blog post is to introduce ultrajson or Ultra JSON, a JSON library written mostly in C and built to be extremely fast.

We have done the benchmark on three popular operations – ‘load’, ‘loads’ and ‘dumps’. We have a dictionary with 3 keys – ‘id’, ‘name’ and ‘address’. We will dump this dictionary using json.dumps() number of times and store in a file. Then we will use json.loads() and json.load() seperately to load the dictionaries from the file. We have performed this experiment on 10000, 50000, 100000, 200000, 1000000 dictionaries and observed how much time it takes to perform the operation by each library.

dumps operation line by line

Here is the result we received using the json.dumps() operations. We have dumped the content dictionary by dictionary.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 84.106 115.876 25.891
50000 395.163 579.576 122.626
100000 820.280 1147.962 246.721
200000 1620.239 2277.786 487.402
500000 3998.736 5682.641 1218.653
1000000 7847.999 11501.038 2530.791

We notice that json performs better than simplejson but ultrajson wins the game with almost 400% speedup than stock json.

dumps operation (all dictionaries at once)

In this experiment, we have stored all the dictionaries in a list and dumped the list using json.dumps().

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 21.674 21.941 14.484
50000 102.739 118.851 67.215
100000 199.454 240.849 138.830
200000 401.376 476.392 270.667
500000 1210.664 1376.511 834.013
1000000 2729.538 2983.563 1830.707

simplejson is almost as good as stock json, but again ultrajson outperforms them by 150% speedup. Now lets see how they perform for load and loads operation.

load operation on a list of dictionaries

Now we do the load operation on a list of dictionaries and compare the results.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 47.040 8.932 10.706
50000 165.877 44.065 45.629
100000 356.405 97.277 105.948
200000 718.873 185.917 205.120
500000 1699.623 461.605 503.203
1000000 3441.075 949.905 1055.966

Surprisingly, simplejson beats other two, with ultrajson being almost close to simplejson. Here, we observe that simplejson is almost 400% faster than stock json, same with ultrajson.

loads operation on dictionaries

In this experiment, we load dictionaries from the file one by one and pass them to the json.loads() function.

Number of dicts json (in ms) simplejson (in ms) ujson (in ms)
10000 99.501 69.517 15.917
50000 407.873 246.794 76.369
100000 893.989 526.257 157.272
200000 1746.961 1025.824 318.306
500000 8446.053 2497.081 791.374
1000000 8281.018 4939.498 1642.416

Again ultrajson steals the show, being almost 600% faster than stock json and 400% faster than simplejson.

Thats all the benchmarks we have here. The verdict is pretty clear. Use simplejson instead of stock json in any cases, since simplejson is well maintained repository. If you really want something extremely fast, then go for ultrajson. In that case, keep in mind that ultrajson only works with well defined collections and will not work for un-serializable collections. But if you are dealing with texts, this should not be a problem.

Gourmet

I was curious for quite a while regarding what gourment actually meant. So I did a little digging today and found out that it means a connoisseur of fine food and drink. Many highend restaurants often put in their ad – “Fine gourmet dining”. It also means that the food was elaborately made and expected to be quite rich in taste. Also it can describe a type of cuisines, restaurants or meals. Other than its elaborate preparation process, the food itself if guranteed to be expensive.

Mutual Exclusion

Mutual Exclusion makes sure that two or more concurrent processes are not accessing a particular resource or accessing some critical section. Go has builtin support for mutexes in its sync package. The following example gives an idea about how mutexes are used in golang:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
package main
import (
    "sync"
)

func main() {
    var mutex = &sync.Mutex{}
    var total = 0
    for i := 0; i < 100; i++ {
        go func() {
            mutex.Lock()
            total += i
            mutex.Unlock()
        }()
    }
}

This is a small program where total is a shared variable and acts as the critical section in the program. We define a mutex and create 100 goroutines which try to add some integer to the total variable. However, we use the ‘Lock’ method from our mutex to make the operation atomic, once the goroutine has successfully incremented and written to the total variable, we release the mutex using the ‘Unlock’ method and other goroutines can now access it. This may not be the most accurate usecase for mutual exclusion, but the same concept can be applied to other critical operations such as database update, changing value in a particular memory location, or using a resource for computation.

Deadlock can occur when two goroutines are waiting for each other and none of them are able to make progress. Deadlock is a classic problem in operating system world. As a programmer, it is advisable to take extra care in your code so that no deadlock condition can arise. Go has support for detecting deadlock at runtime and the error message will contain something like this – all goroutines are asleep – deadlock!. Let us create a small deadlock condition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package main

func Publish(text string, delay time.Duration) (wait <-chan struct{}) {
    ch := make(chan struct{})
    go func() {
        time.Sleep(delay)
        fmt.Println(text)
    }()
    return ch
}

func main() {
    wait := Publish("Sample mesage", 5 * time.Second)
    fmt.Println("Waiting...")
    <- wait
    fmt.Println("Exiting")
}

This code calls the ‘Publish’ method with some string and 5 seconds of delay. In the ‘Publish’ method, a goroutine is created and it prints the text but never closes the channel. In the main function, the main routine waits for the previous channel to close so that it can finish its execution which does not take lace. This leads to a deadlock.

Data race takes place when two goroutines are racing against each other to access some data which can be a variable. However, it is not really easy to predict when the data race will happen when the programmer is writing code. Like deadlock situations, golang is able to detect data races at run time. The following method shows when data race can take place:

1
2
3
4
5
6
7
8
9
10
11
func DataRace() {
    wait := make(chan struct{})
    n := 0
    go func() {
        n++
        close(wait)
    }()
    n++
    <-wait
    fmt.Println(n)
}

It is very hard to predict the which routine will get to the variable n first. The goroutine will try to increment the value of n, then main function will increment it and then the value will be printed. But if the goroutine is slow, the main function will catch hold of the variable and increment it first. There is no damage done in this example, because we are waiting for the goroutine to finish its execution. But if multiple goroutines are involved and they depend on the value of the variable to finish their own execution, it gets tricky to guess what data they might be getting. One solution can be to use a channel and make the value of the variable visible to only a single goroutine at a time. In that way, the value will leave one goroutine and go to another maintaining the sequence. I plan to investigate concurrency patterns of Go in more details and I will keep posting once I do so. Thanks Stefan for the great writeup!

100 Things Challenge

100 things challenge is a popular challenge among minimalists. The aim of this goal is to cut down the number of your personal belongings to less than or equal to 100 things. Leo Babauta sums up this in his blog. Things which are not included in the list are: non personal stuffs like cleaning and cooking items, books, tools or toiletries. Cutting down your possession is quite difficult but not impossible. Many minimalists have done it and posted about this in their blog. When you start the challenge, first count how many possessions you actually have and then take a mental note of how many items you can actually cut down. The end result will be a decluttered home, mental freedom and a self confidence boost. This doesn’t mean that you have to throw away everything, that actually defeats the purpose. Just get rid of the things you have never used in past couple of months, and you will find plenty of stuffs like that.

I am starting this challenge from this month. Of course, it will be difficult to me because I have already accumulated so many things in past 2 years which include collectibles, gifts, books, stationaries, clothes and many other things. But this is a slow and gradual process. I will post again on this on the first week of June to describe how the challenge has been!

Porting Python Packages to Go - Lessons Learned

This week I ported webcolors library from Python to Go. Webcolors is a popular library with more than 6000 downloads in the last month. Also, we needed that for our work at DataWeave. So I decided, why not, lets give it a shot. This will also help brush up my golang skills. But the journey was not so easy as I expected to be. Here are few lessons learned over this project –

  • Go is not as smooth as Python in duck typing. Sometimes I had to write dedicated functions to convert from one type to another.
  • Go doesn’t forgive you if you have even one unused variable in your code. This can be a big PITA if you are experimenting with different things within your code or if you have multiple unused imports.
  • You have to handle multiple return values and errors everywhere in your code. This ultimately turned out to be a good thing.
  • There is not list comprehension or dict comprehension which makes code golfing very hard. Also you can’t just jump from type to types. (No, I did not use interfaces)
  • Writing tests is particularly easy. You just need to put a “_test” file, like “myprog_test.go”. You can just put all your tests there.
  • Packaging is relatively simpler. You can use your entire Github repo as a package and files will be imported from there.

The package is available here. That’s few of them I had on top of my head. Next week I am going to port ‘colorific’ library which will be pretty difficult as golang’s image library is not much sophisticated compared to Python’s PIL. Lets see how it turns out. Fingers crossed!

Type Systems in Go

I again had a chance to use Golang for a small project. I am porting a popular library from Python to Go and it is proving to be more challenging than I originally anticipated. The reason is Go’s type system is more restrictive than Python. In this project, I am using a lot of int, byte, hex, string, slices and maps. And juggling among all of them is quite difficult. For example, Go has two types of integers, signed integers can be represented as int, int32, int64. Unsigned integers can be represented as uint, uint32 and uint64. Upcasting and downcasting works as usual, but problem comes when converting between different data types.

For example, you have a string num = “5” and you would like to convert into an integer. For Python, this is as simple as int(num) and you get back correct result. But not for Go. You need a package strconv which will convert your string into integer using strconv.Atoi(num), just like C. Moreover, this returns you two results, first one is the integer value and second one is the error. If you choose to ignore the error part, compiler will give you an “unused variable” error. So, you have to properly handle the error and then continue on your function. This is all good and I understand that this aims to make the code more robust. But for someone coming from Python environment, this resulted in plenty of facepalms and loads of curse words were muttered.

Another problem I had was to convert from byte stream to integer. So I used the following nifty function I found in SO which does the job by shifting bytes:

1
2
3
4
5
6
7
8
9
func byte_to_int(input []byte) int {
    var output uint32
    l := len(input)
    for i, b := range input {
        shift := uint32((l-i-1) * 8)
        output |= uint32(b) << shift
    }
    return int(output)
}

Online Course on Distributed Systems

I am planning to start this course on Distributed Systems. It is taught by famous Robert Morris, atleast most of the earlier offerings of the course. Since I will not be getting to watch the lecture videos because they are not available, I will have to make do with the suggested paper readings and lab exercises. There are total 4 programming lab assignments and the last one is to build a distributed, sharded key/value store. Much like the idea of msdb I have in mind. Programming assignments are in Go, which delighted me most. Although there are not certificates for finishing the course, I must say that I am not even looking for one. When you do a course just for the sake of some completion certificate, it actually spoils the purpose. Let it be a good learning experience of things I love learning and experimenting.

NoSQL Db in 40 Lines of Code

I kept experimenting with msdb, my own flavor of lightweight, pure python, embedded document storage nosql database. But I am still proud of the first version of code I had written. It was ~ 40 lines and it was able to insert data, open a new database, generate unique ids, search by id and search by keys. The speed of course wasn’t the fastest when compared to popular dbs like Mongo. But it was decent. I remember inserting 1 million documents in 36 seconds without ids and 13 seconds with ids already present.

There are plenty of improvements to make on that. Building a decent query language, concurrency, multithreading, good indexing and the list goes on. I am planning to do all of them slowly and release the first version out after I have implemented hashing and indexing. But I thought to share the msdb 0.0.1, the most basic version. Following is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import uuid
import json

class Database:
    def __init__(self):
        self.path = ""
        self.document = {}
        self.dbfile = None

    def generateID(self):
        return str(uuid.uuid4()).replace('-', '')

    def create(self, dbpath):
        self.path = dbpath
        dbfile = open(self.path, 'w')
        self.dbfile = dbfile

    def open():
        dbfile = open(self.path)
        self.dbfile = dbfile
        self.document = json.load(dbfile)

    def insert(self, userdata):
        if "id" not in userdata.keys():
            userdata["id"] = self.generateID()
        self.dbfile.write(json.dumps(userdata))
        self.dbfile.write("\n")
        self.document[str(userdata["id"])] = userdata

    def findByKey(self, key):
        if str(key) in self.document.keys():
            return self.document[str(key)]
        else:
            return None

    def find(self, key, value):
        finaldocument = list
        for each_doc in self.document.values():
            if str(key) in each_doc.keys() and each_doc[str(key)] == value:
                finaldocument.append(each_doc)
        return finaldocument