Back to Research
Master ThesisMachine Learning

Study of Random Forest and Its Variants

Master's thesis — ensemble learning, hyperparameter tuning, and imbalanced classification

Authors

Rishabh Bhartiya

Date

September 2022

Institution

Univ. Milano

Supervisor

Prof. Gabriele Gianini

Abstract

This thesis investigates Random Forest and its principal variants — ExtraTreesClassifier, BalancedRandomForest, and EasyEnsemble — across three applied medical and financial datasets. The study examines how hyperparameter tuning (n_estimators, max_depth, min_samples_split, class_weight) affects generalization on imbalanced datasets where minority-class precision is clinically or financially critical. Experiments were conducted across credit card fraud detection, breast cancer diagnosis, and heart disease prediction, comparing baseline RF against tuned variants using precision, recall, F1, AUC-ROC, and confusion matrix analysis. The results demonstrate that hyperparameter-tuned Random Forest with balanced class weights consistently outperforms baseline configurations, with BalancedRandomForest achieving the largest gains on the most severely imbalanced dataset (credit card fraud: 0.27% positive rate).

Compared 4 RF variants across credit card fraud, breast cancer, and heart disease datasets

BalancedRandomForest improved minority-class recall from 71% → 92% on fraud detection

Identified class_weight as the single most impactful hyperparameter for imbalanced classification

Full Python/Scikit-learn implementation with stratified 5-fold cross-validation

Full Document

← → arrow keys to navigate · scroll within viewer

100%

Loading

Motivation

Random Forest is one of the most deployed ensemble methods in industry — yet its behavior on imbalanced datasets is poorly understood by practitioners. Class imbalance is the norm, not the exception, in medical diagnostics and financial fraud detection. This thesis was motivated by a practical question: which RF variant, with which hyperparameters, performs best when the minority class matters most?

Research Questions

  • How does Random Forest performance degrade as class imbalance increases?
  • Which RF variant — standard, ExtraTrees, BalancedRF, or EasyEnsemble — best handles severe imbalance?
  • What is the marginal impact of individual hyperparameters on minority-class recall?
  • Do findings generalize across different domain datasets?

Datasets

  • Credit Card Fraud (Kaggle) — 284,807 transactions, 0.17% fraud rate. Extreme imbalance.
  • Breast Cancer Wisconsin — 569 samples, malignant vs benign. Moderate imbalance.
  • Heart Disease (UCI) — 303 samples, presence vs absence. Near-balanced.

Methods

All experiments implemented in Python using Scikit-learn. The study compares four classifier families:

  • RandomForestClassifier — baseline and hyperparameter-tuned configurations
  • ExtraTreesClassifier — extra randomness in split selection, faster training
  • BalancedRandomForestClassifier — undersamples majority class at each bootstrap
  • EasyEnsembleClassifier — ensemble of AdaBoost on balanced subsamples

Hyperparameters tuned via Grid Search with stratified 5-fold cross-validation: n_estimators (50–500), max_depth (None, 10, 20, 30), min_samples_split (2, 5, 10), class_weight ('balanced', None), max_features ('sqrt', 'log2', None).

Key Findings

  • On the credit card fraud dataset, BalancedRandomForest improved minority-class recall from 0.71 (baseline RF) to 0.92, at the cost of a precision decrease from 0.88 to 0.79
  • ExtraTreesClassifier provided 30–40% faster training than standard RF with equivalent or slightly better generalization on near-balanced datasets
  • The class_weight='balanced' parameter was the single most impactful hyperparameter for minority-class F1 on all three datasets
  • EasyEnsemble achieved the highest AUC-ROC across all datasets (0.98, 0.99, 0.97) but at 4× the inference time of standard RF
  • Hyperparameter tuning consistently yielded 5–15% improvement in minority-class F1 over default configurations

Technical Implementation


from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from imblearn.ensemble import BalancedRandomForestClassifier, EasyEnsembleClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import classification_report, roc_auc_score

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'class_weight': ['balanced', None]
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=cv,
    scoring='f1_macro',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_
y_pred = best_rf.predict(X_test)
print(classification_report(y_test, y_pred))
print(f"AUC-ROC: {roc_auc_score(y_test, best_rf.predict_proba(X_test)[:,1]):.4f}")

Conclusion

The thesis demonstrates that algorithm selection and hyperparameter tuning are more impactful than architecture complexity for tabular imbalanced classification. BalancedRandomForest is the recommended choice when minority-class recall is the primary objective; standard RF with class_weight='balanced' is the best default for general use. The study provides a reproducible experimental framework applicable to any imbalanced binary classification problem.

Random ForestEnsemble LearningImbalanced ClassificationScikit-learnHyperparameter TuningAUC-ROC

Univ. Milano

September 2022

Related Works

All research