Random forest benchmarks

ALGLIB numerical library includes parallel large-scale implementation of the decision forest algorithm available in C++, C#, Python and Delphi/FreePascal. In this article we compare the performance of ALGLIB decision forests with other random forests implementations. A general overview of decision forests is provided in another article.

Libraries and test problems

Following random forest libraries are compared:

ALGLIB decision forests - our implementation of the original random forest algorithm
ALGLIB decision forests, compressed - our implementation with binary model compression enabled (does not change model quality, but greatly reduces its memory footprint at the cost of ~1.5x increase in the inference time)
Ranger - a C++ implementation of the random forests algorithm
scikit-learn (sklearn) - a Python random forests implementation
Accord Framework - a .NET implementation of the random forests algorithm

The following synthetic problem was used for testing: 64 variables, 10240 samples, 2 classes. Each of the algorithms was configured to build a forest with 100 trees, using sqrt(N) variables for the splits. All libraries in the list support the parallel forest construction, but for this test only one CPU core was used.

In addition to the running time we also compare model accuracies (out-of-bag estimates, as reported by the algorithm) and model sizes (after serialization). The results are as follows:

Results and discussion

The results are as follows:

Random forests benchmark (single-threaded)

Library

Build time

OOB err

Model size

ALGLIB (compressed)

5 s

0.25

1469 KB

ALGLIB (default)

5 s

0.25

7317 KB

Ranger

9 s

0.25

6957 KB

scikit-learn

10 s

0.25

17377 KB

Accord.NET

9421 s

n/a

19822 KB

First, one may see that all random forest implementations have nearly the same accuracy. They are all implementations of the same core algorithm, after all. The only exception is Accord.NET which does not report out-of-bag estimates, but we have no reasons to suspect that it is worse or better than its competitors in this respect.

The running times, however, are significantly different. ALGLIB library clearly outperforms its best competitors (Ranger and sklearn) by roughly 2x factor. This behavior is roughly scale-independent and is explained by the better optimized implementation of the same ideas. On the other hand, Accord.NET is a clear outlier - its performance on this problem is 1800 times worse than that of ALGLIB.

Finally, we move to the memory footprint of the random forests. Generally, all random forest implementations have the same upper bound on the model size: O(NTrees·M), where NTrees is the forest size and M is the dataset size (samples count). However, depending on the implementation details (whether it is a dynamic language like Python, managed .NET environment or the native C/C++ application), different multipliers can be hidden behind the big-O notation.

By default, ALGLIB produces uncompressed decision trees that result in roughly the same memory usage as that of Ranger (C++), with Python (sklearn) and C# (Accord.NET) implementation being roughly 2.5x worse than both ALGLIB and Ranger. The situation changes when the binary model compression (dfbinarycompression function) is activated: compressed decision forests are 5x smaller than the ones produced by Ranger, 12x smaller than the ones produced by sklearn and 14x smaller than the ones produced by Accord.NET.

Conclusions

ALGLIB is 2x faster than the closest competitors (Ranger, sklearn) and produces compact models even in its default configuration (no binary compression). Compressed trees produced by ALGLIB are the most compact trees ever, which makes it a good choice for extremely large machine learning problems!

This article is licensed for personal use only.

Download ALGLIB for C++ / C# / Java / Python / ...

ALGLIB Project offers you two editions of ALGLIB:

ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license

ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan

Links to download sections for Free and Commercial editions can be found below: