Counting words in different programming languages II.
I started a word counting challenge a few months ago and it received a lot more interest than I had expected:
- 18 unique solutions in 13 languages (if you count Python2 and 3 separate languages),
- from over 20 contributors,
- totaling in 734 lines of code,
- via 17 accepted pull requests.
Here is the repository.
Organizing this challenge has been an amazing experience for me, and I learnt a few valuable lessons:
- I should have made the rules much clearer at the beginning instead of clarifying them later.
- I should have decided on the test data sooner to avoid having multiple leaderboard tables.
- I ended up adding half of the test cases later than announcing the competition to cover cases I had not anticipated at first. The end results are still not exactly the same due to different unicode sorting algorithms.
Results
I used the Hungarian Wikipedia for the final experiments. Each solution was tested on the first 5 million lines and later on the full Hungarian Wikipedia, which is 65 million lines (3.6GB) and consists of 28,284,983 unique words.
Besides the best solutions for each language, my original C++ and Python2 implementations are also listed.
Results on 5 million lines
Each solution was tested 3 times and the fastest run is listed.
Rank | Experiment | CPU seconds | User time | Maximum memory |
---|---|---|---|---|
1 | cpp/wc_vector | 35.61 | 34.33 | 772244 |
2 | python/wordcount_py2gabor.py | 36.58 | 35.17 | 597112 |
3 | go/bin/wordcount | 37.92 | 36.45 | 855768 |
4 | php7.0 php/wordcount.php | 54.31 | 39.66 | 709476 |
5 | python/wordcount_py2.py | 60.34 | 58.18 | 1432904 |
6 | python/wordcount_py3.py | 93.57 | 90.93 | 1241448 |
7 | mono csharp/WordCountList.exe | 96.02 | 66.34 | 898000 |
8 | perl/wordcount.pl | 103.66 | 101.03 | 1237780 |
9 | php5.6 php/wordcount.php | 124.38 | 106.93 | 2119420 |
10 | java -classpath java WordCount | 145.07 | 130.04 | 1816224 |
11 | julia julia/wordcount.jl | 152.75 | 148.43 | 2568724 |
12 | bash/wordcount.sh | 273.16 | 288.23 | 13616 |
13 | haskell/WordCount | 284.39 | 275.77 | 4208052 |
14 | cpp/wc_baseline | 359.12 | 343.69 | 979528 |
15 | nodejs javascript/wordcount.js | 565.26 | 563.83 | 977348 |
Results on the full Hungarian Wikipedia
A few solutions did not run successfully on the full dataset:
- all Java versions seem to hang for hours and were killed (by me) after a day,
- the NodeJS and the Haskell versions run out of memory (16GB).
It’s also interesting to note that the PHP version is tested using both PHP5 and PHP7 with the latter performing significantly better. But without further ado, the final results are:
Rank | Experiment | CPU seconds | User time | Maximum memory |
---|---|---|---|---|
1 | cpp/wc_vector | 267.9 | 257.14 | 4126276 |
2 | python/wordcount_py2gabor.py | 333.34 | 321.56 | 3844908 |
3 | go/bin/wordcount | 349.26 | 332.43 | 6066928 |
4 | php7.0 php/wordcount.php | 464.55 | 377.82 | 4039392 |
5 | python/wordcount_py2.py | 545.38 | 529.71 | 8670208 |
6 | mono csharp/WordCountList.exe | 796.2 | 637.8 | 4780360 |
7 | perl/wordcount.pl | 881.23 | 861.38 | 6979772 |
8 | python/wordcount_py3.py | 909.86 | 888.65 | 7561112 |
9 | php5.6 php/wordcount.php | 1121.57 | 1001.84 | 12468856 |
10 | julia julia/wordcount.jl | 1798.55 | 1763.0 | 7284708 |
11 | bash/wordcount.sh | 2100.96 | 2128.94 | 13768 |
There were so many contributors that I can’t list them all here, please check in the repository.
Can I still contribute?
Yes, you are welcome to contribute. Especially solutions in other languages are welcome. However, I won’t update this article with your results. If a second wave of great solutions come, I might write a third article.
See the contribution guidelines in the previous article and in the repositories README.