Feature Importance

One of the most basic questions is the feature importance, namely what features have the biggest impact on predictions ?

There are multiple ways to measure feature importance. Some approaches answer subtly different versions of the question above. Other approaches have documented shortcomings. Compared to most other approaches, permutation importance is:

  • fast to calculate
  • widely used and understood
  • consistent with properties we would want a feature importance measure to have

How It Works

Permutation importance is calculated after a model has been fitted. Permutation importance is answered by the following question:

  • If we randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would the affect the accuracy of predictions in that now-shuffled data ?

With this insight, the process is as follows:

  1. Get a trained model.
  2. Shuffle the values in a single column, make predictions using the resulting dataset. Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.
  3. Return the data to the original order. Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.

Interpreting Permutation Importance

The more accuracy decrease, the more important of that feature. Like most things in data science, there is some randomness to the exact performance change from a shuffling column. We can measure the amount of randomness in our permutation importance calculation by repeating the process with multiple shuffles and calculate the mean.

In those cases, which have negative values for permutation importance, the predictions on the shuffled data happened to be more accurate than the real data. This happens when the feature didn’t matter (should have had an importance close to 0), but random chance caused the predictions on shuffled data to be more accurate. This is more common with small datasets, like the one in this example, because there is more room for luck/chance.

Reference