Classification and regression dataset formats

This article describes the dataset formats for classification and regression problems used by decision forest, an ALGLIB implementation of the random forest algorithm. A general overview of ALGLIB decision forests (available in C++ and C#) is provided in another article.

    1 Dataset Format
    2 Nominal Variable Encoding
    3 Missing Values Encoding
    4 Downloads section

Dataset Format

Random forest builder accepts datasets in a matrix format, with matrix rows corresponding to sample elements and matrix columns corresponding to variables.

The dataset matrix for a problem with M elements and N variables has M·(N+1) size, with the last column being either class index (from 0 to C-1, for classification problems) or target value (for regression problems).

Note #1
Unlike ALGLIB interpolation algorithms, the decision forests do not support the problems with multiple target variables.

Nominal Variable Encoding

Nominal variables can be encoded in several ways: as integer index or using either 1-of-N or 1-of-N-1 encoding. Random forest classifiers are encoding-agnostic, but still different encodings result in different classifier performance. It is recommended to comply with the following conventions:

Nominal variables with two possible values are encoded by either 0 or 1 (that is, using the 1-of-N-1 encoding).
Nominal variables with three or more possible values are encoded using 1-of-N encoding (e.g., "red", "yellow" and "green" can be encoded as "1 0 0", "0 1 0", "0 0 1").
A nominal variable can also be encoded by integer (0, 1, 2, ...). Such encoding is recommended for ordinal variables (ones which values can be arranged by increase/decrease) or variables with too many possible values to use 1-of-N encoding.

Missing Values Encoding

Decision trees and random forests normally do not support missing values. However, it is possible to use the following two-step workaround: to add a special flag variable, indicating the missing value (i.e. variables count doubles), and to replace the missing value itself by mean/mode/median value.

This article is licensed for personal use only.

Download ALGLIB for C++ / C# / Java / Python / ...

ALGLIB Project offers you two editions of ALGLIB:

ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license

ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan

Links to download sections for Free and Commercial editions can be found below: