This article describes the dataset formats for classification and regression problems used by decision forest, an ALGLIB implementation of the random forest algorithm. A general overview of ALGLIB decision forests (available in C++ and C#) is provided in another article.
Random forest builder accepts datasets in a matrix format, with matrix rows corresponding to sample elements and matrix columns corresponding to variables.
The dataset matrix for a problem with M elements and N variables has M·(N+1) size, with the last column being either class index (from 0 to C-1, for classification problems) or target value (for regression problems).
Note #1
Unlike ALGLIB interpolation algorithms, the decision forests do not support the problems with multiple target variables.
Nominal variables can be encoded in several ways: as integer index or using either 1-of-N or 1-of-N-1 encoding. Random forest classifiers are encoding-agnostic, but still different encodings result in different classifier performance. It is recommended to comply with the following conventions:
Decision trees and random forests normally do not support missing values. However, it is possible to use the following two-step workaround: to add a special flag variable, indicating the missing value (i.e. variables count doubles), and to replace the missing value itself by mean/mode/median value.
This article is licensed for personal use only.
ALGLIB Project offers you two editions of ALGLIB:
ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license
ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan
Links to download sections for Free and Commercial editions can be found below: