Classification and regression dataset formats

This article describes the dataset formats for classification and regression problems used by decision forest, an ALGLIB implementation of the random forest algorithm. A general overview of ALGLIB decision forests (available in C++ and C#) is provided in another article.

Contents

    1 Dataset Format
    2 Nominal Variable Encoding
    3 Missing Values Encoding
    4 Downloads section

Dataset Format

Random forest builder accepts datasets in a matrix format, with matrix rows corresponding to sample elements and matrix columns corresponding to variables.

The dataset matrix for a problem with M elements and N variables has M·(N+1) size, with the last column being either class index (from 0 to C-1, for classification problems) or target value (for regression problems).

Note #1
Unlike ALGLIB interpolation algorithms, the decision forests do not support the problems with multiple target variables.

Nominal Variable Encoding

Nominal variables can be encoded in several ways: as integer index or using either 1-of-N or 1-of-N-1 encoding. Random forest classifiers are encoding-agnostic, but still different encodings result in different classifier performance. It is recommended to comply with the following conventions:

Missing Values Encoding

Decision trees and random forests normally do not support missing values. However, it is possible to use the following two-step workaround: to add a special flag variable, indicating the missing value (i.e. variables count doubles), and to replace the missing value itself by mean/mode/median value.

This article is licensed for personal use only.

Download ALGLIB for C++ / C# / Java / Python / ...

ALGLIB Project offers you two editions of ALGLIB:

ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license

ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan

Links to download sections for Free and Commercial editions can be found below:

ALGLIB 4.03.0 for C++

C++ library.
Delivered with sources.
Monolithic design.
Extreme portability.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for C#

C# library with native kernels.
Delivered with sources.
VB.NET and IronPython wrappers.
Extreme portability.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for Java

Java wrapper around HPC core.
Delivered with sources.
Seamless integration with Java.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for Delphi

Delphi wrapper around C core.
Delivered as precompiled binary.
Compatible with FreePascal.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for CPython

CPython wrapper around C core.
Delivered as precompiled binary.
Editions:   FREE   COMMERCIAL