Random forest, also known as decision forest, is a popular ensemble method of classification, regression and several other tasks.
ALGLIB includes one of the best open source implementations of the decision forest algorithm available in C++, C#, Python and Delphi/FreePascal. Additional features include variable importance, binary compression, ExtraTrees.
1 Random/decision forest algorithm overview
2 Additional features
Variable importance
Binary compression
Extremely randomized trees (ExtraTrees)
3 Decision forest API
4 Benchmarks
5 Other articles in this section
6 Downloads section
Strictly speaking, "random forest" is an original algorithm developed by Leo Breiman and Adele Cutler (with its name being trademarked). Independent implementations sometimes have slight deviations from the original idea, but still use this name (sometimes changing it to a neutral 'decision forest', like we did in ALGLIB).
Anyway, all random forest-like algorithms share the same core ideas:
Random forests can be used out of the box without any tuning, provide fast non-iterative training, scale well to big data problems and come with internal unbiased estimate of the generalization error (the so called out-of-bag estimate).
However, we may note one major disadvantage of random forests - large memory requirements. Memory usage grows linearly with ensemble size and linearly/sublinearly with the dataset size. This problem is partially solved by the binary model compression, recently introduced in ALGLIB.
Random forests are often used for variable importance estimation due to their unbiased nature. Sometimes they are used for variable importance estimation (and not for inference) alone.
ALGLIB has several variable importance algorithms - Gini importance (also known as mean decrease in impurity, both training set and out-of-bag versions) and permutation estimator (the gold standard of the estimator - unbiased, but sometimes very slow). These features are discussed in more detail in article on the variable importance estimation.
We have already mentioned that random forests suffer from the memory consumption problem - the decision tree is big, and we have many of them in the forest. ALGLIB decision forests support binary model compression which often results in a roughly 4x-6x memory footprint reduction. The binary compression should be explicitly activated by the dfbinarycompression function.
ExtraTrees are similar to random forests, but they add much more randomness to the forest. Instead of choosing the best split point during the decision tree construction, a random split point is selected. Up to sqrt(N) such random splits (with different variables) is tried, and the best random split is selected.
ALGLIB decision forest builder can be configured to produce extremely randomized trees. The example below is written in C++, but exactly the same sequence of calls will work in C# or other languages:
Decision forest functionality is provided by the dforest subpackage, which includes functions for the decision forest construction (an extensive configurable API is provided), inference, serialization and variable importance estimation.
ALGLIB API separates forest construction and inference: the former is performed by the decisionforestbuilder class, and the latter is performed by the decisionforest class. One starts from creating a builder object, provides a dataset and (optionally) tweaks algorithm parameters. After that, a decision forest instance is created. ALGLIB Reference Manual includes two examples: randomforest_cls and randomforest_reg.
ALGLIB has several versions in different programming languages, including C++ and C#. All versions provide exactly the same decision forest API, but with different performance. C++ version is, as expected, the fastest one. C# version is somewhat slower due to the array bounds checks. An interesting feature is the ability to move serialized decision forest models between C++ and C# versions.
Commercial users may get roughly linear speed-up due to parallel capabilities absent in the Free ALGLIB. Additionally, commercial C# users may access native C computational core and get the same performance as C++ users. Aside from that, Free and Commercial editions are equal.
We prepared a benchmark that compares ALGLIB decision forests with several well known random forest implementations, including Ranger (C/C++), sklearn and Accord.NET.
Below is a complete list of all articles in this section that provide an in-depth description of the topics briefly discussed above:
ALGLIB Project offers you two editions of ALGLIB:
ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license
ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan
Links to download sections for Free and Commercial editions can be found below: