I started a word counting challenge a few months ago and it received a lot more interest than I had expected:

  • 18 unique solutions in 13 languages (if you count Python2 and 3 separate languages),
  • from over 20 contributors,
  • totaling in 734 lines of code,
  • via 17 accepted pull requests.

Here is the repository.

Organizing this challenge has been an amazing experience for me, and I learnt a few valuable lessons:

  • I should have made the rules much clearer at the beginning instead of clarifying them later.
  • I should have decided on the test data sooner to avoid having multiple leaderboard tables.
  • I ended up adding half of the test cases later than announcing the competition to cover cases I had not anticipated at first. The end results are still not exactly the same due to different unicode sorting algorithms.

Results

I used the Hungarian Wikipedia for the final experiments. Each solution was tested on the first 5 million lines and later on the full Hungarian Wikipedia, which is 65 million lines (3.6GB) and consists of 28,284,983 unique words.

Besides the best solutions for each language, my original C++ and Python2 implementations are also listed.

Results on 5 million lines

Each solution was tested 3 times and the fastest run is listed.

Rank Experiment CPU seconds User time Maximum memory
1 cpp/wc_vector 35.61 34.33 772244
2 python/wordcount_py2gabor.py 36.58 35.17 597112
3 go/bin/wordcount 37.92 36.45 855768
4 php7.0 php/wordcount.php 54.31 39.66 709476
5 python/wordcount_py2.py 60.34 58.18 1432904
6 python/wordcount_py3.py 93.57 90.93 1241448
7 mono csharp/WordCountList.exe 96.02 66.34 898000
8 perl/wordcount.pl 103.66 101.03 1237780
9 php5.6 php/wordcount.php 124.38 106.93 2119420
10 java -classpath java WordCount 145.07 130.04 1816224
11 julia julia/wordcount.jl 152.75 148.43 2568724
12 bash/wordcount.sh 273.16 288.23 13616
13 haskell/WordCount 284.39 275.77 4208052
14 cpp/wc_baseline 359.12 343.69 979528
15 nodejs javascript/wordcount.js 565.26 563.83 977348

 

Results on the full Hungarian Wikipedia

A few solutions did not run successfully on the full dataset:

  • all Java versions seem to hang for hours and were killed (by me) after a day,
  • the NodeJS and the Haskell versions run out of memory (16GB).

It’s also interesting to note that the PHP version is tested using both PHP5 and PHP7 with the latter performing significantly better. But without further ado, the final results are:

Rank Experiment CPU seconds User time Maximum memory
1 cpp/wc_vector 267.9 257.14 4126276
2 python/wordcount_py2gabor.py 333.34 321.56 3844908
3 go/bin/wordcount 349.26 332.43 6066928
4 php7.0 php/wordcount.php 464.55 377.82 4039392
5 python/wordcount_py2.py 545.38 529.71 8670208
6 mono csharp/WordCountList.exe 796.2 637.8 4780360
7 perl/wordcount.pl 881.23 861.38 6979772
8 python/wordcount_py3.py 909.86 888.65 7561112
9 php5.6 php/wordcount.php 1121.57 1001.84 12468856
10 julia julia/wordcount.jl 1798.55 1763.0 7284708
11 bash/wordcount.sh 2100.96 2128.94 13768

 

There were so many contributors that I can’t list them all here, please check in the repository.

Can I still contribute?

Yes, you are welcome to contribute. Especially solutions in other languages are welcome. However, I won’t update this article with your results. If a second wave of great solutions come, I might write a third article.

See the contribution guidelines in the previous article and in the repositories README.