Unlocking Random Forest In Machine Learning

November 30, 2021

6 min read

In this tutorial, we’ll explain the random forest algorithm in machine learning.

Random forests are powerful, popular, and easy to use algorithms for predictive modeling. As the name suggests, the model is an ensemble of many decision trees, with better performance than an individual tree alone. The algorithm can be used for both supervised classification and regression problems.

Following this tutorial, you’ll learn:

What are ensembling and bagging?
What is a random forest in machine learning?
How to apply random forests using Python sklearn?
And more!

To make your decision tree more powerful, let’s explore the forest!

Before learning details about random forests, let’s look at these basic questions, one-by-one.

What is Ensemble Learning?

Ensembling is a technique to build a predictive model by combining the strengths of multiple simpler base models. The goal of the method is to produce better predictions than any of the individual models alone.

So the ensembling process involves developing a group of base models from the training data, and then ensembling them in a certain way. We can either:

build the base models independently and then averaging their results to get the final prediction.
In this case, the ensembled predictor often has better predictions because of reduced variance.
or,
build the final model sequentially that evolves.
In this case, the ensembled predictor usually produces better predictions because of less bias.

We won’t expand details on all the possible methodologies. But random forests use the averaging ensembling method mentioned above, or, more specifically, the Bootstrap AGGregatING (bagging) method. And it’s often the first ensembling algorithm machine learning beginners master.

Next, let’s dig more into bagging.

What is Bagging?

Bagging consists of repeatedly taking bootstrap samples (random sampling with replacement) of the training dataset and using them to fit machine learning models.

The aggregated prediction from bagging is the average (regression) or majority votes (classification) of the predictions of each of the models that were trained.

Let’s see an example.

For a training dataset D with 6 data points [0, 1, 2, 3, 4, 5], we can draw three random samples with replacement of size 6:

[0, 1, 2, 4, 4, 5]
[1, 2, 3, 3, 5, 5]
[0, 1, 3, 5, 5, 5]

Then we can fit three models using these random samples.

For a new observation, we use all three models to make predictions. If this is:

a regression problem and the three models give predictions of 5.2, 7.7, and 6.9.
The final prediction of the bagging method would be their average, i.e., (5.2 + 7.7 + 6.9)/3.
a classification problem with target 1 or 0 and two out of the three models predict 1.
We would conclude the prediction is 1 based on bagging.

Bagging improves the prediction, especially well when the base models have low-bias and high-variance. In this situation, the average of the predictions would still be low-biased, while the variance of the errors from the aggregated model would be less than each individual model.

This is exactly the case for the random forest.

Because the decision trees, when grown to sufficient depth, tend to have low bias but high variance (i.e., the overfitting problems). The random forest, as a bagging model of trees, has better predictions with low-bias and less variance than each tree.

Two sources of Randomness

So far, we’ve seen the first main source of randomness in the forest, which comes from the bagging method (random sampling).

To further lower the variance of the random forest, extra “randomness” is introduced as well. Instead of using the whole set of features to fit decision trees, random forest algorithm randomly selects a subset of features when splitting for the trees.

This randomness helps make the trees less correlated and more diverse, so that they can cancel out each other’s errors. So the final aggregated model can predict better with lower variance.

That’s a lot of explanation!

Now we are ready to piece this information together.

READ MORE: [button style=’accent’ url=’https://arc.liwaiwai.com/2021/06/09/success-with-machine-learning-projects-in-python/’ target=’_blank’ arrow=’true’ fullwidth=’true’]SUCCESS WITH MACHINE LEARNING PROJECTS IN PYTHON[/button]

[button style=’accent’ url=’https://arc.liwaiwai.com/2021/03/09/assessing-regulatory-fairness-through-machine-learning/’ target=’_blank’ arrow=’true’ fullwidth=’true’]ASSESSING REGULATORY FAIRNESS THROUGH MACHINE LEARNING[/button]

What is Random Forest in Machine Learning?

To summarize, we can build random forests based on the general procedures below.

Step #1: From the training dataset of N observations and M features, draw a random sample with replacement of size n (n <= N).

Step #2: Grow a decision tree using the sample by:

randomly select m (m <= M) features from the M features as candidates for splitting at each node.
pick the best split based on these m features.

Until the stopping conditions (if there are any) are met.

Step #3: Repeat the previous two steps and grow many different decision trees to form a forest.

Now we have a random forest, which is an ensemble of trees!

Step #4: To predict a new observation, we use:

the average of the prediction from all the trees for regression problems.
the majority votes of class from the trees for classification problems.

With the theoretical understanding, it’s time to apply the random forest with Python.

Python Example with sklearn

Build Random Forest

In this last section, we’ll fit both a decision tree and a random forest using Python scikit-learn (sklearn), and compare their results.

First, we import the necessary libraries for model building, dataset, and model evaluation.

We’ll use the breast cancer dataset with a binary target (benign or malignant).

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_breast_cancer

from sklearn.model_selection import train_test_split

from sklearn.metrics import plot_confusion_matrix, roc_curve, roc_auc_score

import matplotlib.pyplot as plt

import numpy as np

random_seed = 88888

view raw import_libraries.py hosted with

by GitHub

Next, we split the dataset into training and test datasets.

breast_cancer = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(breast_cancer.data, breast_cancer.target, test_size=0.2, stratify=breast_cancer.target, random_state=random_seed)

view raw train_test_split.py hosted with

by GitHub

We can fit a decision tree using the training dataset and calculate its confusion matrix using the test dataset.

dt = DecisionTreeClassifier(random_state=random_seed)

dt.fit(X_train, y_train)

y_test_pred_dt = dt.predict_proba(X_test)

plot_confusion_matrix(dt, X_test, y_test)

view raw decision_tree_confusion_matrix.py hosted with

by GitHub

decision tree python sklearn machine learning confusion matrix

We can also fit a random forest and print out its confusion matrix in the same way. In this example, we set the forest to contain 500 trees, but you may tune this hyperparameter to find its optimal value.

rf = RandomForestClassifier(random_state=random_seed, n_estimators=500)

rf.fit(X_train, y_train)

y_test_pred_rf = rf.predict_proba(X_test)

plot_confusion_matrix(rf, X_test, y_test)

view raw random_forest_confusion_matrix.py hosted with

by GitHub

random forest python sklearn machine learning confusion matrix

As you can see, the random forest predicts better than the decision tree in terms of both True Positive and True Negative.

Further Reading: If you are not familiar with the evaluation matrix, read 8 popular Evaluation Metrics for Machine Learning Models.

Let’s also look at AUC and the ROC curve.

	fpr_dt, tpr_dt, thresholds_lstm = roc_curve(y_test, y_test_pred_dt[:,1])
	fpr_rf, tpr_rf, thresholds_lstm = roc_curve(y_test, y_test_pred_rf[:,1])

	dt_auc = roc_auc_score(y_test, y_test_pred_dt[:,1])
	rf_auc = roc_auc_score(y_test, y_test_pred_rf[:,1])

	plt.figure(figsize=(10,8))

	plt.plot([0, 1], [0, 1], ‘k–‘)
	plt.plot(fpr_dt, tpr_dt, label=‘Decision Tree (AUC = {:.3f})’.format(dt_auc))
	plt.plot(fpr_rf, tpr_rf, label=‘Random Forest (AUC = {:.3f})’.format(rf_auc))
	plt.xlabel(‘False positive rate’)
	plt.ylabel(‘True positive rate’)
	plt.title(‘ROC curve’)
	plt.legend(loc=‘best’)
	plt.show()

view raw auc_roc.py hosted with

by GitHub

We can see the improvement of random forest compared to decision tree as well.

random forest vs decision tree machine learning roc curve auc

With that said, we can easily visualize the decision tree and interpret its results, but we can’t do it for random forest due to the increased complexity.

Feature Importance/Selection

Besides predicting, the random forest is also useful to rank the importance of features.

sklearn provides the impurity-based feature importances calculation based on random forest. The calculation is the (normalized) total reduction of the impurity criterion by the feature. The higher the importance, the more important the variable.

importances = rf.feature_importances_

indices = importances.argsort()

feature_names = breast_cancer.feature_names

# Plot the impurity-based feature importances of the forest

y_ticks = np.arange(0, len(feature_names))

fig, ax = plt.subplots(figsize=(10,8))

ax.barh(y_ticks, importances[indices])

ax.set_yticklabels(feature_names[indices])

ax.set_yticks(y_ticks)

ax.set_title(“Random Forest Feature Importances (MDI)”)

fig.tight_layout()

plt.show()

view raw feature_importance.py hosted with

by GitHub

From the chart below, we can see that the features are ranked by the importances from higher to lower.

One drawback of this method is that the high cardinality categorical feature with many unique categories could produce misleading results. Because random forests (or decision trees) are biased in favor of these variables with more levels.

random forest machine learning feature importance python sklearn

Another way of calculating feature importance is by using the permutation_importance, which can solve the high cardinality categorical variable problem. But it requires adjustments when the features are highly correlated, which is the case here, so we are not covering it in this tutorial.

That’s it!

To summarize, you’ve learned what is and how to use random forest for machine learning modeling.

Try to apply it to your next data science project!