Creating word frequency lists is an easy task in most programming languages but how easy it is exactly? And what are the performance trade-offs? We played around with our favorite programming languages and got surprising results. The experiment is still going, you can participate too. And please do.
A couple weeks ago I had a homework where I had to create a word frequency list from text read from the standard input, and output the words and frequencies in decreasing order. Ties had to be printed in alphabetical order. This seems like a fairly easy task, I do it all the time in Python, so I was curious how easy it would be in C++.
My first version looked like this [wordcount_map.cpp]:
The algorithm is rather straightforward:
- Define a counter object, it has to be addressable with arbitrary strings.
- Count words.
- Sort the counter by frequency.
- Print word-frequency pairs.
Now let’s see how it works.
$ echo "alma körte szilva alma körte" | ./cpp/wc_baseline alma 2 körte 2 szilva 1
It does a good job counting the occurrences on 5 words (Hungarian fruit names).
How does it work on a bigger dataset? We will get back to it later.
Since I do similar tasks in Python almost every day, I did the shortest Python implementation I could come up with. A slightly neater version is available here: wordcount.py
Step 3 and 4 can be grouped together into a sorted iteration over the dictionary elements.
Let’s test both scripts on some real data. I downloaded the Hungarian Wikisource XML database dump for testing and did no further preprocessing. Although the whole file is 2,000,000 lines, I used the first 500,000 for testing.
$ g++ cpp/wordcount_map.cpp -std=c++11 -o cpp/wc_baseline -O3 $ time head -500000 data/huwikisource-latest-pages-meta-current.xml | ./cpp/wc_baseline > /tmp/cpp_out real 0m6.308s user 0m5.920s sys 0m0.480s
Now let’s compare it with Python.
$ time head -500000 data/huwikisource-latest-pages-meta-current.xml | ./python/wordcount_py2.py > /tmp/python2_out real 0m5.647s user 0m5.477s sys 0m0.220s
OK, that was unexpected. Python’s supposed to much slower than C++, right? Before we dwell into the details, let’s see the output.
The output is nothing surprising. The most frequent words are Hungarian function words, articles (a, az, és) on the top. We also get some MediaWiki markup mixed in such as =.
$ head /tmp/cpp_out a 126995 | 45751 az 42187 és 37088 = 22811 nem 22529 hogy 20949 A 19994 s 19042 - 17857
The bottom of the list looks like this:
$ tail /tmp/cpp_out 와일드]] 1 작품]]</comment> 1 저자]]</comment> 1 태도]] 1 프롤레타리아]] 1 �</text> 1 �Bon 1 �Good 1 �Guten 1 �békés 1
(It’s clear from these few lines that the word tokenization is far from trivial and for real life scenarios we need more sophisticated solutions than this.)
There are actually 421,738 hapax (words that only appear once) out of 2,758,420 total words (573,090 types - i.e. unique words). Since we require a secondary sort by words in the case of frequency ties, non-Hungarian words and invalid characters are sorted last place.
Why is the C++ implementation slower?
Examining the data structures tells us that
std::map is actually a red-black tree in contrast to Python’s
dict which is a hash table.
Both have their advantages but for this case insertion is particularly interesting, since we insert almost 600 thousand words.
The insertion in a red-black tree is O(log n), while it is O(1) on average in the case of hash table.
There are just too many word types.
Fortunately the STL has several hash table-based containers such as the
std::unordered_map, we just have to replace
$ g++ cpp/wordcount_hashtable.cpp -std=c++11 -o cpp/wc_hash -O3 $ time head -500000 data/huwikisource-latest-pages-meta-current.xml | ./cpp/wc_hash > /tmp/cpp2_out real 0m4.970s user 0m4.613s sys 0m0.520s $ diff /tmp/cpp_out /tmp/cpp2_out | wc -l # making sure the outputs are the same 0
Python is finally beaten but not very much. And it turns out that with an older g++ (mine is 5.2 on Manjaro Linux) this is not even true, but let’s get back to that later.
Can we speed up C++ a bit more?
Standard C++ streams are synchronized to the standard C streams after each input/output operation. This behavior can be disabled by setting sync_with_stdio to false.
Now let’s try this version (see the full code here ).
$ g++ cpp/wordcount_hashtable_nosync_stdio.cpp -o ./cpp/wc_hash_nosync -std=c++11 -O3 $ time head -500000 data/huwikisource-latest-pages-meta-current.xml | ./cpp/wc_hash_nosync >/tmp/cpp_out real 0m3.647s user 0m3.273s sys 0m0.480s
Still, a few things can be improved, like using const references instead of copying or inserting the (word, frequency) pairs into a vector, then sorting the vector. The current best version is here.
$ g++ cpp/wordcount_vector.cpp -o ./cpp/wc_vector -std=c++11 -O3 $ time head -500000 data/huwikisource-latest-pages-meta-current.xml | ./cpp/wc_vector > /tmp/cpp3_out real 0m2.644s user 0m2.340s sys 0m0.403s
The current leaderboard is here.
The full test was run in a Docker container on Ubuntu with GCC 4.8, so - to my surprise - the running times were quite different too.
Full results for 500,000 lines are summarized in this table:
|Rank||Experiment||CPU seconds||User time||Memory|
And for the full huwikisource dump:
|Rank||Experiment||CPU seconds||User time||Maximum memory|
* the shell script’s memory consumption cannot measured properly with
time, suggestions are welcome.
I’ve never used Docker before and I decided to give it a try for this experiment. Docker is similar to a virtual machine but much more lightweight. You can download images or build your own and load them into Docker containers which are mini operating systems with root access for you. Images are built by running commands read from a Dockerfile. I’m not going to bore you with all the technical details, you can find an explanation in this README.
I created a Dockerfile which installs the dependencies needed by this experiment. My friends added other languages and their dependencies too, so once you build the image and load it into a container, the experiment should work without installing additional packages.
You can reproduce the whole experiment with running a few commands inside your container.
Can you do better?
Your favorite language is missing? Or you can do better in language X? Please do.
You’re welcome to improve any of the existing scripts or provide a new one. Please read the wordcount repository’s README for instructions on how to install and add a new script. If your code successfully installs and compiles with the provided Docker image and it passes the tests, run a few experiments. If you have reasonably good results - not necessarily a new winner - please create a pull request and I’ll add it to the main repository.
The point of this experiment was to compare simple solutions in different programming languages to a simple problem. It turns out the results are not what I expected, but by no means am I trying to tell which language is better than the other. I also do not claim that these are the best possible solutions.