RobustScaler for feature normalization (median+IQR, resistant to fraud outliers), RandomForestClassifier, train/test split with stratification, and all evaluation metrics.
Business Problem
Credit card fraud accounts for billions of dollars in annual losses worldwide. The core challenge is not technical difficulty — it is data imbalance. Legitimate transactions vastly outnumber fraudulent ones (578:1 in this dataset), which means a naive model that predicts "everything is normal" achieves 99.83% accuracy while catching zero fraud.
The goal of this project is to build a model that reliably identifies the rare fraudulent transaction while minimizing false alarms that disrupt genuine customers. Every metric choice, preprocessing step, and modeling decision is driven by this real-world constraint.
Time, Amount, and Class retain their original meaning.
Pipeline Architecture & Tools
The pipeline is structured in six sequential stages: data loading and EDA, feature scaling, class-imbalance handling, model training, evaluation, and business-impact quantification.
Gradient boosted trees with L1/L2 regularization and scale_pos_weight for native imbalance handling.
Synthetic Minority Oversampling Technique applied exclusively on the training set to avoid data leakage.
Publication-style charts with the seaborn-v0_8 theme: Precision-Recall curves, confusion matrices, feature importance, and business impact bars.
The Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
confusion_matrix, precision_recall_curve,
average_precision_score, roc_auc_score, f1_score
)
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
# Publication-style charts — seaborn-v0_8 provides a subtle grid
# and clean typography without the raw matplotlib defaults.
plt.style.use("seaborn-v0_8")
sns.set_context("talk", font_scale=0.9)
PALETTE = {"Normal": "#2ecc71", "Fraud": "#e74c3c"}
df = pd.read_csv("creditcard.csv")
class_counts = df["Class"].value_counts()
class_pct = df["Class"].value_counts(normalize=True) * 100
# Result: Normal = 284,315 (99.83%) | Fraud = 492 (0.17%)
# A naive classifier that predicts "Normal" every time achieves 99.83%
# accuracy — but catches ZERO fraud. This is why accuracy is useless
# here. We use AUPRC (Area Under the Precision-Recall Curve) instead,
# which focuses entirely on the minority class.
print(f"Class imbalance ratio: {class_counts[0] / class_counts[1]:.0f}:1")
# Why RobustScaler instead of StandardScaler?
# StandardScaler computes mean +/- std. A single $25,000 fraud transaction
# skews both statistics heavily. RobustScaler uses median + IQR, which are
# resistant to extreme outliers — exactly what fraud data contains.
scaler = RobustScaler()
# V1–V28 are already PCA-scaled. Only Amount and Time need normalization.
df["Amount_scaled"] = scaler.fit_transform(df[["Amount"]])
df["Time_scaled"] = scaler.fit_transform(df[["Time"]])
df.drop(["Amount", "Time"], axis=1, inplace=True)
X = df.drop("Class", axis=1)
y = df["Class"]
# stratify=y ensures the 0.17% fraud rate is preserved in both
# train and test sets — without this, the test set could have no fraud.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Strategy A — Class Weights (built into the model):
# The simplest and most leak-proof approach. The model applies a higher
# penalty to misclassifying the minority class. No synthetic data generated.
rf_weighted = RandomForestClassifier(class_weight="balanced", ...)
# Strategy B — SMOTE (Synthetic Minority Oversampling Technique):
# Generates synthetic fraud samples by interpolating between real neighbors.
# CRITICAL: SMOTE is applied ONLY on the training set. Applying it before
# splitting would leak information from the test set into training data.
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
# Before SMOTE: {0: 227,451 | 1: 394}
# After SMOTE: {0: 227,451 | 1: 227,451} — perfectly balanced
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
scale_pos = neg_count / pos_count # ≈ 578
xgb_model = XGBClassifier(
n_estimators=300,
max_depth=6,
learning_rate=0.05,
scale_pos_weight=scale_pos, # XGBoost's equivalent of class_weight
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
eval_metric="aucpr", # optimize for AUPRC during training
random_state=42,
n_jobs=-1
)
def evaluate_model(model, X_tr, y_tr, X_te, y_te):
model.fit(X_tr, y_tr)
y_pred = model.predict(X_te)
y_proba = model.predict_proba(X_te)[:, 1]
# AUPRC is the gold standard for imbalanced datasets.
# ROC-AUC can be misleadingly optimistic because the huge negative
# class dilutes the false-positive rate. PR curves only look at
# the minority class, which is exactly what we care about.
auprc = average_precision_score(y_te, y_proba)
roc = roc_auc_score(y_te, y_proba)
f1 = f1_score(y_te, y_pred)
cm = confusion_matrix(y_te, y_pred)
return auprc, roc, f1, cm
COST_PER_FP = 10 # USD — cost of a false alarm (SMS, call center, temp block)
for name, res in results.items():
tp_mask = (y_test.values == 1) & (res["y_pred"] == 1)
fn_mask = (y_test.values == 1) & (res["y_pred"] == 0)
money_saved = amount_test[tp_mask].sum() # fraud correctly blocked
money_lost = amount_test[fn_mask].sum() # fraud that slipped through
fp_cost = res["cm"][0, 1] * COST_PER_FP
net_benefit = money_saved - fp_cost
print(f"{name}: saved ${money_saved:,.2f} | lost ${money_lost:,.2f} | net ${net_benefit:,.2f}")
Visualizations
Four multi-panel charts walk through each stage of the analysis — from raw data exploration to model comparison and financial impact. Each figure is built with plt.subplots() and tight_layout() to keep all panels cleanly separated.
Conclusion
This project demonstrates that choosing the right evaluation metric is more important than choosing the right model. Using AUPRC instead of accuracy completely reframes what "good performance" means for imbalanced datasets — and leads to very different modeling decisions.
The business-impact analysis shows that the gap between a mediocre and a great model is not 0.5 AUPRC points on a leaderboard — it is thousands of dollars of unblocked fraud versus a handful of inconvenienced legitimate customers. That is the framing that gets data scientists a seat at the table.