ALGLIB numerical library includes parallel large-scale implementation of the decision forest algorithm available in C++, C#, Python and Delphi/FreePascal. In this article we compare the performance of ALGLIB decision forests with other random forests implementations. A general overview of decision forests is provided in another article.
Following random forest libraries are compared:
The following synthetic problem was used for testing: 64 variables, 10240 samples, 2 classes. Each of the algorithms was configured to build a forest with 100 trees, using sqrt(N) variables for the splits. All libraries in the list support the parallel forest construction, but for this test only one CPU core was used.
In addition to the running time we also compare model accuracies (out-of-bag estimates, as reported by the algorithm) and model sizes (after serialization). The results are as follows:
The results are as follows:
First, one may see that all random forest implementations have nearly the same accuracy. They are all implementations of the same core algorithm, after all. The only exception is Accord.NET which does not report out-of-bag estimates, but we have no reasons to suspect that it is worse or better than its competitors in this respect.
The running times, however, are significantly different. ALGLIB library clearly outperforms its best competitors (Ranger and sklearn) by roughly 2x factor. This behavior is roughly scale-independent and is explained by the better optimized implementation of the same ideas. On the other hand, Accord.NET is a clear outlier - its performance on this problem is 1800 times worse than that of ALGLIB.
Finally, we move to the memory footprint of the random forests. Generally, all random forest implementations have the same upper bound on the model size: O(NTrees·M), where NTrees is the forest size and M is the dataset size (samples count). However, depending on the implementation details (whether it is a dynamic language like Python, managed .NET environment or the native C/C++ application), different multipliers can be hidden behind the big-O notation.
By default, ALGLIB produces uncompressed decision trees that result in roughly the same memory usage as that of Ranger (C++), with Python (sklearn) and C# (Accord.NET) implementation being roughly 2.5x worse than both ALGLIB and Ranger. The situation changes when the binary model compression (dfbinarycompression function) is activated: compressed decision forests are 5x smaller than the ones produced by Ranger, 12x smaller than the ones produced by sklearn and 14x smaller than the ones produced by Accord.NET.
ALGLIB is 2x faster than the closest competitors (Ranger, sklearn) and produces compact models even in its default configuration (no binary compression). Compressed trees produced by ALGLIB are the most compact trees ever, which makes it a good choice for extremely large machine learning problems!
This article is licensed for personal use only.
ALGLIB Project offers you two editions of ALGLIB:
ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license
ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan
Links to download sections for Free and Commercial editions can be found below: