← Back to portfolio
Machine Learning Case Study — Kaggle Competition

Food Delivery Time Prediction with Random Forest & XGBoost

An end-to-end regression pipeline that predicts food delivery time (in minutes) from 45,593 real-world orders. The project covers exploratory analysis, geographic feature engineering (haversine distance), a scikit-learn preprocessing pipeline with imputation, a seven-model benchmark (Ridge through a Stacking Ensemble), and a business-impact analysis framed around 30-minute SLA compliance.

Python scikit-learn XGBoost LightGBM CatBoost Stacking Ensemble Haversine Distance Matplotlib / Seaborn RMSE / R²

Business Problem

In the competitive food delivery market, delivery time is the single metric customers care about most. A platform that consistently overpromises delivery windows loses customer trust and incurs chargeback costs. One that always pads estimates to be safe drives customers to competitors with shorter quoted times.

The goal is to build a model that predicts delivery time accurately enough to support a 30-minute SLA promise — knowing in advance which orders are on track and which need proactive communication. Beyond RMSE, every modeling decision is evaluated against that operational constraint.

Dataset: Kaggle “Food Delivery Time Prediction” competition. 45,593 training orders across Indian cities, with features covering delivery personnel, weather, traffic, vehicle type, order category, and GPS coordinates for both the restaurant and the drop-off location.

Pipeline Architecture & Tools

The pipeline is structured in five sequential stages: data cleaning, feature engineering, preprocessing, model training, and business-impact evaluation.

Feature Engineering

Three derived features drive most of the predictive signal: distance_km (haversine straight-line distance from restaurant to customer), pickup_wait (minutes between order placed and rider pick-up), and order_hour (extracted from the order timestamp to capture peak-hour effects).

scikit-learn Pipelines

A ColumnTransformer applies SimpleImputer + StandardScaler to numeric columns and SimpleImputer + OneHotEncoder to categoricals — all within a single Pipeline to prevent data leakage. CatBoost receives raw string categoricals and handles them natively; HistGBRT passes NaN through without any imputer.

Seven-Model Benchmark

Ridge (baseline), Random Forest, XGBoost, HistGradientBoostingRegressor, LightGBM, and CatBoost are all evaluated head-to-head. XGBoost, HistGBRT, LightGBM, and CatBoost cluster tightly at RMSE 3.97–3.98 min (R² 0.819–0.820), showing diminishing per-model returns — the key signal to justify a stacking approach.

Stacking Ensemble

A StackingRegressor blends Random Forest, XGBoost, and LightGBM via 5-fold cross-validation, with a Ridge meta-learner combining their out-of-fold predictions. This produces the best result: RMSE 3.90 min, R² 0.826 — a 37% improvement over Ridge and a meaningful step past any individual ensemble model.

SLA Business Analysis

Model predictions are mapped to a binary “on-time vs. at-risk” classification at the 30-minute SLA threshold, then evaluated using precision, recall, and accuracy across all 8 entries (including a naive mean baseline) — metrics the operations team can act on directly.

The Code

Imports & aesthetic configuration Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor

plt.style.use("seaborn-v0_8")
sns.set_context("talk", font_scale=0.88)
Feature engineering — distance, pickup wait, and order hour Python
def haversine_km(lat1, lon1, lat2, lon2):
    """Great-circle distance between two GPS coordinates (in km)."""
    R = 6371.0
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat, dlon = lat2 - lat1, lon2 - lon1
    a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
    return R * 2 * np.arcsin(np.sqrt(a))

# Straight-line distance from restaurant to drop-off — the strongest
# individual predictor (more distance → more time, on avg 1.4 min/km).
df["distance_km"] = haversine_km(
    df["Restaurant_latitude"],  df["Restaurant_longitude"],
    df["Delivery_location_latitude"], df["Delivery_location_longitude"],
)

# Pickup wait: minutes from order placed to rider departing.
# Long waits indicate kitchen backlog and predict longer end-to-end times.
def to_minutes(t):
    h, m, s = t.split(":")
    return int(h) * 60 + int(m) + int(s) / 60

df["ordered_min"] = df["Time_Orderd"].apply(to_minutes)
df["picked_min"]  = df["Time_Order_picked"].apply(to_minutes)
df["pickup_wait"] = (df["picked_min"] - df["ordered_min"]).clip(lower=0)

# Order hour: captures peak demand (lunch 12–14, dinner 19–21)
# when traffic is heaviest and delivery times are longest.
df["order_hour"] = df["ordered_min"].apply(lambda x: int(x // 60) % 24)
Preprocessing — imputation, scaling, and encoding inside a Pipeline Python
NUM_FEATURES = [
    "Delivery_person_Age", "Delivery_person_Ratings",
    "Vehicle_condition", "multiple_deliveries",
    "distance_km", "pickup_wait", "order_hour",
]
CAT_FEATURES = [
    "Weatherconditions", "Road_traffic_density",
    "Type_of_order", "Type_of_vehicle", "Festival", "City",
]

# Why use a Pipeline?
# 1. All transformers are fit on training data only — no leakage.
# 2. The whole chain (impute → scale/encode → model) can be pickled
#    and served as a single artifact in production.
# 3. Cross-validation and hyperparameter tuning work correctly
#    because the preprocessor refits on each fold automatically.

num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),   # handles NaN ages/ratings
    ("scaler",  StandardScaler()),
])
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),  # handles NaN city/weather
    ("ohe",     OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])
preprocessor = ColumnTransformer([
    ("num", num_pipe, NUM_FEATURES),
    ("cat", cat_pipe, CAT_FEATURES),
])

X_train, X_test, y_train, y_test = train_test_split(
    df[NUM_FEATURES + CAT_FEATURES],
    df["Time_taken_min"],
    test_size=0.2,
    random_state=42,
)
Model training — seven models including HistGBRT, LightGBM, CatBoost Python
from sklearn.ensemble import HistGradientBoostingRegressor, StackingRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# HistGBRT: scikit-learn's native fast GBDT — handles NaN without imputation
# and uses histograms for fast binning (similar to LightGBM internally).
models["HistGBRT"] = Pipeline([
    ("prep",  preprocessor_hist),   # passthrough numerics, OHE categoricals
    ("model", HistGradientBoostingRegressor(
        max_iter=400, max_depth=6, learning_rate=0.05,
        l2_regularization=0.1, random_state=42,
    )),
])

# LightGBM: histogram GBDT from Microsoft — typically fastest to train,
# competitive with XGBoost, and exposes a scikit-learn compatible API.
models["LightGBM"] = Pipeline([
    ("prep",  preprocessor),
    ("model", LGBMRegressor(
        n_estimators=400, max_depth=6, learning_rate=0.05,
        subsample=0.8, colsample_bytree=0.8,
        reg_alpha=0.1, reg_lambda=1.0,
        n_jobs=-1, random_state=42, verbosity=-1,
    )),
])

# CatBoost: passes categorical columns as raw strings, converting them
# internally via target statistics — no OHE required, often strongest
# on data with many high-cardinality categoricals.
cb_model = CatBoostRegressor(
    iterations=400, depth=6, learning_rate=0.05,
    l2_leaf_reg=3, random_seed=42, verbose=0,
    cat_features=cat_feature_indices,   # raw column positions
)
cb_model.fit(X_train_cb, y_train)   # X_train_cb keeps cats as str dtype

# Results after all 6 base models:
# Ridge (baseline):  RMSE=6.21  R²=0.560
# Random Forest:     RMSE=3.94  R²=0.823
# XGBoost:           RMSE=3.98  R²=0.819
# HistGBRT:          RMSE=3.98  R²=0.820
# LightGBM:          RMSE=3.98  R²=0.819
# CatBoost:          RMSE=3.97  R²=0.820
Stacking ensemble — blending RF, XGBoost, LightGBM with a Ridge meta-learner Python
# Why stack?
# All five GBDT models cluster at R² 0.819-0.823 — the individual
# models have hit a ceiling on this data. A stacking ensemble exploits
# their different error patterns to push past what any single model
# achieves. Each base learner's out-of-fold predictions become the
# meta-learner's feature matrix, preventing leakage.

stacking = StackingRegressor(
    estimators=[
        ("rf",  rf_pipeline),
        ("xgb", xgb_pipeline),
        ("lgb", lgb_pipeline),
    ],
    final_estimator=Ridge(alpha=1.0),
    cv=5,          # 5-fold CV for out-of-fold meta-features
    n_jobs=-1,
)
stacking.fit(X_train, y_train)

# Stacking Ensemble:  RMSE=3.90  MAE=3.13  R²=0.826  ← new best
# Improvement over best single model (RF): +0.003 R²  /  -0.04 RMSE
# SLA accuracy: 93.5%  |  precision: 94.4%  |  recall: 96.3%
Business impact — 30-minute SLA compliance across all 8 entries Python
SLA = 30  # minutes — the promise shown to customers at checkout

def sla_metrics(y_true, y_pred, sla=SLA):
    """Treats delivery prediction as a binary 'on-time / at-risk' classifier."""
    actual_yes = y_true <= sla
    pred_yes   = y_pred <= sla
    tp = (actual_yes &  pred_yes).sum()
    fp = (~actual_yes &  pred_yes).sum()
    fn = (actual_yes & ~pred_yes).sum()
    tn = (~actual_yes & ~pred_yes).sum()
    return {
        "precision": tp / (tp + fp),   # of predicted on-time, how many were?
        "recall":    tp / (tp + fn),   # of truly on-time, how many did we catch?
        "accuracy":  (tp + tn) / len(y_true),
    }

# Naive model: always predicts the training-set mean (26.3 min)
naive_pred = np.full_like(y_test.values, y_train.mean(), dtype=float)

# Stacking Ensemble SLA accuracy:  93.5%  |  precision: 94.4%  |  recall: 96.3%
# Naive baseline SLA accuracy:      70.2%  — >23pp gap attributable to the model

Visualizations

Four multi-panel charts walk through each stage of the analysis — from raw data exploration through model comparison, feature importance, and SLA business impact. Each figure is built with plt.subplots() on a dark background to match the portfolio’s visual language.

Exploratory Data Analysis. Delivery time is right-skewed (mean 26.3 min, max 54 min). Traffic “Jam” adds ~10 minutes over “Low” density. Sandstorms and stormy weather are the biggest weather penalties. Distance shows a clear positive correlation (slope ~1.4 min/km) — the foundation for haversine feature engineering.
Model Performance. The Stacking Ensemble (RMSE 3.90 min, R² 0.826) edges past all six base models. XGBoost, HistGBRT, LightGBM, and CatBoost cluster tightly at R² 0.819–0.820 — the diminishing-returns pattern that motivates stacking. All ensemble models cut Ridge's error by 37%. The residual distribution is tightly centered at zero with no systematic bias.
Feature Importance. All three models highlight driver ratings and multiple-delivery flag as top numeric signals. Low-traffic and sunny-weather indicators dominate in LightGBM and CatBoost because they capture when fast deliveries are possible. distance_km consistently ranks in the top 4 across all three algorithms — validating the haversine engineering step.
Business Impact. The Stacking Ensemble achieves the best SLA accuracy at 93.5%, a 23-percentage-point improvement over the naive mean-prediction baseline (70.2%). Precision of 94.4% means fewer than 6% of “on-time” promises will be broken, and recall of 96.3% means almost no genuinely fast orders are mislabeled as “at risk.” All 8 entries including the naive baseline are shown for full transparency.

Conclusion

The biggest lesson from this project is that raw data rarely contains the right representation. The dataset ships with GPS coordinates — but latitude and longitude alone carry almost no predictive signal when treated as plain numerics. Converting them to a haversine distance produced a top-4 feature across all three importance charts.

The second lesson is about model ceilings. When XGBoost, HistGBRT, LightGBM, and CatBoost all converge to within 0.001 R² of each other, adding another base model will not help — but a stacking ensemble that blends their out-of-fold predictions can still extract additional signal by exploiting different error patterns among the learners.

The SLA framing also changes the conversation with stakeholders. A standalone RMSE of 3.90 minutes is hard to contextualize; “93.5% of our 30-minute SLA promises are correct — up 23 points over guessing the average” is a number an operations team can build a customer-communication strategy around.