Variable importance estimation

ALGLIB numerical library includes parallel large-scale implementation of the decision forest algorithm available in C++, C#, Python and Delphi/FreePascal. This article describes the variable importance estimation using the random forests. A general overview of decision forests is provided in another article.

Contents

    1 Variable importance estimation algorithms
           Gini variable importance (mean decrease in impurity)
           Permutation importance
           Comparison
    2 ALGLIB API
    3 Downloads section

Variable importance estimation algorithms

Gini variable importance (mean decrease in impurity)

The Gini importance formula, also known as mean decrease in impurity, evaluates variable importance as a mean relative decrease in impurity during splits involving this variable. ALGLIB supports two versions of this method: one that uses training sample (activated by dfbuildersetimportancetrngini function) and one that uses out-of-bag sample (activated by dfbuildersetimportanceoobgini function).

The training sample version of the method is the fastest and produces results that always sum to 1. The latter property can sometimes be confusing, because results always sum to 1, even when the problem is completely unpredictable. In such cases all variables will be estimated as equally important, with importance equal to 1/NVars. Thus, the training sample version is somewhat prone to overfitting.

The out-of-bag version is slower, but it is less prone to overfitting. It tends to produce fair estimates, and if your problem is a completely random noise, it will output nearly zero importances.

Permutation importance

The permutation variable importance estimator is also known as MDA. This version of importance estimation algorithm analyzes mean increase in out-of-bag sum of squared residuals after random permutation of J-th variable. The result is divided by the error computed with all variables being perturbed in order to produce the R-squared-like estimate in [0,1] range. This estimator is activated by the dfbuildersetimportancepermutation function.

Such estimate is slower to calculate than the Gini rating because it needs multiple inference runs for each of the variables being studied. ALGLIB uses highly optimized parallel algorithm which analyzes path through the decision tree and allows to handle most perturbations in O(1) time. Nevertheless, requesting MDA importance may increase forest construction time from 10% to 200% (or more, if you have thousands of variables).

Comparison

Informally speaking, permutation importance estimator (MDA) rating answers the following question: what part of the model predictive power is ruined by permuting k-th variable? The Gini importance estimator (MDI) tells us , what part of the model predictive power was achieved due to usage of k-th variable.

Thus, MDI (and OOB-MDI too) tends to divide "unit amount of importance" between several important variables. Contrary to that, MDA rates each variable independently at [0,1] scale. Critically important variable will have rating close to 1.0, and you may have multiple variables with such a rating.

ALGLIB API

You should activate one of the variable importance estimation methods prior to construction decision forest, using: dfbuildersetimportancetrngini for MDI, dfbuildersetimportanceoobgini for OOB-MDI, dfbuildersetimportancepermutation for MDA.

After calling dfbuilderbuildrandomforest construction function you will get a decision forest and a dfreport object. The latter contains rep.varimportances array (variable ratings) and rep.topvars (variables ordered by importance decrease).

This article is licensed for personal use only.

Download ALGLIB for C++ / C# / Java / Python / ...

ALGLIB Project offers you two editions of ALGLIB:

ALGLIB Free Edition:
+delivered for free
+offers full set of numerical functionality
+extensive algorithmic optimizations
-no multithreading
-non-commercial license

ALGLIB Commercial Edition:
+flexible pricing
+offers full set of numerical functionality
+extensive algorithmic optimizations
+high performance (SMP, SIMD)
+commercial license with support plan

Links to download sections for Free and Commercial editions can be found below:

ALGLIB 4.03.0 for C++

C++ library.
Delivered with sources.
Monolithic design.
Extreme portability.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for C#

C# library with native kernels.
Delivered with sources.
VB.NET and IronPython wrappers.
Extreme portability.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for Java

Java wrapper around HPC core.
Delivered with sources.
Seamless integration with Java.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for Delphi

Delphi wrapper around C core.
Delivered as precompiled binary.
Compatible with FreePascal.
Editions:   FREE   COMMERCIAL

ALGLIB 4.03.0 for CPython

CPython wrapper around C core.
Delivered as precompiled binary.
Editions:   FREE   COMMERCIAL