← Back to portfolio
Machine Learning Case Study

Credit Card Fraud Detection with Random Forest & XGBoost

An end-to-end ML pipeline that identifies fraudulent credit card transactions in a dataset of 284,807 records where only 0.17% are fraud. The project covers exploratory analysis, feature scaling, two class-imbalance strategies (class weights and SMOTE), model comparison, and a business-impact analysis that translates technical metrics into dollars saved.

Python scikit-learn XGBoost imbalanced-learn SMOTE Matplotlib / Seaborn AUPRC

Business Problem

Credit card fraud accounts for billions of dollars in annual losses worldwide. The core challenge is not technical difficulty — it is data imbalance. Legitimate transactions vastly outnumber fraudulent ones (578:1 in this dataset), which means a naive model that predicts "everything is normal" achieves 99.83% accuracy while catching zero fraud.

The goal of this project is to build a model that reliably identifies the rare fraudulent transaction while minimizing false alarms that disrupt genuine customers. Every metric choice, preprocessing step, and modeling decision is driven by this real-world constraint.

The dataset is the public Kaggle Credit Card Fraud Detection dataset (ULB). Features V1–V28 are the result of a PCA transformation applied to protect cardholder privacy. Only Time, Amount, and Class retain their original meaning.

Pipeline Architecture & Tools

The pipeline is structured in six sequential stages: data loading and EDA, feature scaling, class-imbalance handling, model training, evaluation, and business-impact quantification.

scikit-learn

RobustScaler for feature normalization (median+IQR, resistant to fraud outliers), RandomForestClassifier, train/test split with stratification, and all evaluation metrics.

XGBoost

Gradient boosted trees with L1/L2 regularization and scale_pos_weight for native imbalance handling.

imbalanced-learn (SMOTE)

Synthetic Minority Oversampling Technique applied exclusively on the training set to avoid data leakage.

Matplotlib / Seaborn

Publication-style charts with the seaborn-v0_8 theme: Precision-Recall curves, confusion matrices, feature importance, and business impact bars.

The Code

Imports & aesthetic configuration Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    confusion_matrix, precision_recall_curve,
    average_precision_score, roc_auc_score, f1_score
)
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE

# Publication-style charts — seaborn-v0_8 provides a subtle grid
# and clean typography without the raw matplotlib defaults.
plt.style.use("seaborn-v0_8")
sns.set_context("talk", font_scale=0.9)
PALETTE = {"Normal": "#2ecc71", "Fraud": "#e74c3c"}
Exploratory Data Analysis — class imbalance check Python
df = pd.read_csv("creditcard.csv")

class_counts = df["Class"].value_counts()
class_pct    = df["Class"].value_counts(normalize=True) * 100

# Result: Normal = 284,315 (99.83%)  |  Fraud = 492 (0.17%)
# A naive classifier that predicts "Normal" every time achieves 99.83%
# accuracy — but catches ZERO fraud. This is why accuracy is useless
# here. We use AUPRC (Area Under the Precision-Recall Curve) instead,
# which focuses entirely on the minority class.
print(f"Class imbalance ratio: {class_counts[0] / class_counts[1]:.0f}:1")
Preprocessing — feature scaling & stratified split Python
# Why RobustScaler instead of StandardScaler?
# StandardScaler computes mean +/- std. A single $25,000 fraud transaction
# skews both statistics heavily. RobustScaler uses median + IQR, which are
# resistant to extreme outliers — exactly what fraud data contains.
scaler = RobustScaler()

# V1–V28 are already PCA-scaled. Only Amount and Time need normalization.
df["Amount_scaled"] = scaler.fit_transform(df[["Amount"]])
df["Time_scaled"]   = scaler.fit_transform(df[["Time"]])
df.drop(["Amount", "Time"], axis=1, inplace=True)

X = df.drop("Class", axis=1)
y = df["Class"]

# stratify=y ensures the 0.17% fraud rate is preserved in both
# train and test sets — without this, the test set could have no fraud.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
Class-imbalance strategies — Class Weights vs. SMOTE Python
# Strategy A — Class Weights (built into the model):
# The simplest and most leak-proof approach. The model applies a higher
# penalty to misclassifying the minority class. No synthetic data generated.
rf_weighted = RandomForestClassifier(class_weight="balanced", ...)

# Strategy B — SMOTE (Synthetic Minority Oversampling Technique):
# Generates synthetic fraud samples by interpolating between real neighbors.
# CRITICAL: SMOTE is applied ONLY on the training set. Applying it before
# splitting would leak information from the test set into training data.
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Before SMOTE: {0: 227,451  |  1: 394}
# After  SMOTE: {0: 227,451  |  1: 227,451}  — perfectly balanced
Model training & evaluation function Python
neg_count  = (y_train == 0).sum()
pos_count  = (y_train == 1).sum()
scale_pos  = neg_count / pos_count   # ≈ 578

xgb_model = XGBClassifier(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    scale_pos_weight=scale_pos,   # XGBoost's equivalent of class_weight
    reg_alpha=0.1,                # L1 regularization
    reg_lambda=1.0,               # L2 regularization
    eval_metric="aucpr",          # optimize for AUPRC during training
    random_state=42,
    n_jobs=-1
)

def evaluate_model(model, X_tr, y_tr, X_te, y_te):
    model.fit(X_tr, y_tr)
    y_pred  = model.predict(X_te)
    y_proba = model.predict_proba(X_te)[:, 1]

    # AUPRC is the gold standard for imbalanced datasets.
    # ROC-AUC can be misleadingly optimistic because the huge negative
    # class dilutes the false-positive rate. PR curves only look at
    # the minority class, which is exactly what we care about.
    auprc = average_precision_score(y_te, y_proba)
    roc   = roc_auc_score(y_te, y_proba)
    f1    = f1_score(y_te, y_pred)
    cm    = confusion_matrix(y_te, y_pred)
    return auprc, roc, f1, cm
Business impact — translating metrics into dollars Python
COST_PER_FP = 10  # USD — cost of a false alarm (SMS, call center, temp block)

for name, res in results.items():
    tp_mask    = (y_test.values == 1) & (res["y_pred"] == 1)
    fn_mask    = (y_test.values == 1) & (res["y_pred"] == 0)

    money_saved = amount_test[tp_mask].sum()   # fraud correctly blocked
    money_lost  = amount_test[fn_mask].sum()   # fraud that slipped through
    fp_cost     = res["cm"][0, 1] * COST_PER_FP

    net_benefit = money_saved - fp_cost
    print(f"{name}: saved ${money_saved:,.2f} | lost ${money_lost:,.2f} | net ${net_benefit:,.2f}")

Visualizations

Four multi-panel charts walk through each stage of the analysis — from raw data exploration to model comparison and financial impact. Each figure is built with plt.subplots() and tight_layout() to keep all panels cleanly separated.

Exploratory Data Analysis. Class imbalance (578:1), transaction amount on log scale revealing fraud outliers, temporal patterns showing fraud spikes at specific hours, and the top 12 PCA features most correlated with fraud.
Model Performance. Precision-Recall curves for all 4 models (XGBoost Weights leads at AUPRC 0.876), grouped bar chart comparing AUPRC / ROC AUC / F1, and confusion matrices for the best and runner-up models showing TP/FP/FN/TN counts.
Feature Importance. Both Random Forest and XGBoost rank V14 as the dominant fraud signal by a wide margin, followed by V4 and V10. Agreement between two different algorithms increases confidence that these PCA components capture genuine transaction anomalies.
Business Impact. XGBoost (Weights) blocks $8,523 in fraud on the test set while missing only $2,122. After subtracting a conservative $10-per-false-alarm cost (only 13 false alarms), the net benefit is $8,393 — a strong case for production deployment.

Conclusion

This project demonstrates that choosing the right evaluation metric is more important than choosing the right model. Using AUPRC instead of accuracy completely reframes what "good performance" means for imbalanced datasets — and leads to very different modeling decisions.

The business-impact analysis shows that the gap between a mediocre and a great model is not 0.5 AUPRC points on a leaderboard — it is thousands of dollars of unblocked fraud versus a handful of inconvenienced legitimate customers. That is the framing that gets data scientists a seat at the table.