Three derived features drive most of the predictive signal: distance_km (haversine straight-line distance from restaurant to customer), pickup_wait (minutes between order placed and rider pick-up), and order_hour (extracted from the order timestamp to capture peak-hour effects).
Business Problem
In the competitive food delivery market, delivery time is the single metric customers care about most. A platform that consistently overpromises delivery windows loses customer trust and incurs chargeback costs. One that always pads estimates to be safe drives customers to competitors with shorter quoted times.
The goal is to build a model that predicts delivery time accurately enough to support a 30-minute SLA promise — knowing in advance which orders are on track and which need proactive communication. Beyond RMSE, every modeling decision is evaluated against that operational constraint.
Pipeline Architecture & Tools
The pipeline is structured in five sequential stages: data cleaning, feature engineering, preprocessing, model training, and business-impact evaluation.
A ColumnTransformer applies SimpleImputer + StandardScaler to numeric columns and SimpleImputer + OneHotEncoder to categoricals — all within a single Pipeline to prevent data leakage. CatBoost receives raw string categoricals and handles them natively; HistGBRT passes NaN through without any imputer.
Ridge (baseline), Random Forest, XGBoost, HistGradientBoostingRegressor, LightGBM, and CatBoost are all evaluated head-to-head. XGBoost, HistGBRT, LightGBM, and CatBoost cluster tightly at RMSE 3.97–3.98 min (R² 0.819–0.820), showing diminishing per-model returns — the key signal to justify a stacking approach.
A StackingRegressor blends Random Forest, XGBoost, and LightGBM via 5-fold cross-validation, with a Ridge meta-learner combining their out-of-fold predictions. This produces the best result: RMSE 3.90 min, R² 0.826 — a 37% improvement over Ridge and a meaningful step past any individual ensemble model.
Model predictions are mapped to a binary “on-time vs. at-risk” classification at the 30-minute SLA threshold, then evaluated using precision, recall, and accuracy across all 8 entries (including a naive mean baseline) — metrics the operations team can act on directly.
The Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from xgboost import XGBRegressor
plt.style.use("seaborn-v0_8")
sns.set_context("talk", font_scale=0.88)
def haversine_km(lat1, lon1, lat2, lon2):
"""Great-circle distance between two GPS coordinates (in km)."""
R = 6371.0
lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
dlat, dlon = lat2 - lat1, lon2 - lon1
a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
return R * 2 * np.arcsin(np.sqrt(a))
# Straight-line distance from restaurant to drop-off — the strongest
# individual predictor (more distance → more time, on avg 1.4 min/km).
df["distance_km"] = haversine_km(
df["Restaurant_latitude"], df["Restaurant_longitude"],
df["Delivery_location_latitude"], df["Delivery_location_longitude"],
)
# Pickup wait: minutes from order placed to rider departing.
# Long waits indicate kitchen backlog and predict longer end-to-end times.
def to_minutes(t):
h, m, s = t.split(":")
return int(h) * 60 + int(m) + int(s) / 60
df["ordered_min"] = df["Time_Orderd"].apply(to_minutes)
df["picked_min"] = df["Time_Order_picked"].apply(to_minutes)
df["pickup_wait"] = (df["picked_min"] - df["ordered_min"]).clip(lower=0)
# Order hour: captures peak demand (lunch 12–14, dinner 19–21)
# when traffic is heaviest and delivery times are longest.
df["order_hour"] = df["ordered_min"].apply(lambda x: int(x // 60) % 24)
NUM_FEATURES = [
"Delivery_person_Age", "Delivery_person_Ratings",
"Vehicle_condition", "multiple_deliveries",
"distance_km", "pickup_wait", "order_hour",
]
CAT_FEATURES = [
"Weatherconditions", "Road_traffic_density",
"Type_of_order", "Type_of_vehicle", "Festival", "City",
]
# Why use a Pipeline?
# 1. All transformers are fit on training data only — no leakage.
# 2. The whole chain (impute → scale/encode → model) can be pickled
# and served as a single artifact in production.
# 3. Cross-validation and hyperparameter tuning work correctly
# because the preprocessor refits on each fold automatically.
num_pipe = Pipeline([
("imputer", SimpleImputer(strategy="median")), # handles NaN ages/ratings
("scaler", StandardScaler()),
])
cat_pipe = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")), # handles NaN city/weather
("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])
preprocessor = ColumnTransformer([
("num", num_pipe, NUM_FEATURES),
("cat", cat_pipe, CAT_FEATURES),
])
X_train, X_test, y_train, y_test = train_test_split(
df[NUM_FEATURES + CAT_FEATURES],
df["Time_taken_min"],
test_size=0.2,
random_state=42,
)
from sklearn.ensemble import HistGradientBoostingRegressor, StackingRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
# HistGBRT: scikit-learn's native fast GBDT — handles NaN without imputation
# and uses histograms for fast binning (similar to LightGBM internally).
models["HistGBRT"] = Pipeline([
("prep", preprocessor_hist), # passthrough numerics, OHE categoricals
("model", HistGradientBoostingRegressor(
max_iter=400, max_depth=6, learning_rate=0.05,
l2_regularization=0.1, random_state=42,
)),
])
# LightGBM: histogram GBDT from Microsoft — typically fastest to train,
# competitive with XGBoost, and exposes a scikit-learn compatible API.
models["LightGBM"] = Pipeline([
("prep", preprocessor),
("model", LGBMRegressor(
n_estimators=400, max_depth=6, learning_rate=0.05,
subsample=0.8, colsample_bytree=0.8,
reg_alpha=0.1, reg_lambda=1.0,
n_jobs=-1, random_state=42, verbosity=-1,
)),
])
# CatBoost: passes categorical columns as raw strings, converting them
# internally via target statistics — no OHE required, often strongest
# on data with many high-cardinality categoricals.
cb_model = CatBoostRegressor(
iterations=400, depth=6, learning_rate=0.05,
l2_leaf_reg=3, random_seed=42, verbose=0,
cat_features=cat_feature_indices, # raw column positions
)
cb_model.fit(X_train_cb, y_train) # X_train_cb keeps cats as str dtype
# Results after all 6 base models:
# Ridge (baseline): RMSE=6.21 R²=0.560
# Random Forest: RMSE=3.94 R²=0.823
# XGBoost: RMSE=3.98 R²=0.819
# HistGBRT: RMSE=3.98 R²=0.820
# LightGBM: RMSE=3.98 R²=0.819
# CatBoost: RMSE=3.97 R²=0.820
# Why stack?
# All five GBDT models cluster at R² 0.819-0.823 — the individual
# models have hit a ceiling on this data. A stacking ensemble exploits
# their different error patterns to push past what any single model
# achieves. Each base learner's out-of-fold predictions become the
# meta-learner's feature matrix, preventing leakage.
stacking = StackingRegressor(
estimators=[
("rf", rf_pipeline),
("xgb", xgb_pipeline),
("lgb", lgb_pipeline),
],
final_estimator=Ridge(alpha=1.0),
cv=5, # 5-fold CV for out-of-fold meta-features
n_jobs=-1,
)
stacking.fit(X_train, y_train)
# Stacking Ensemble: RMSE=3.90 MAE=3.13 R²=0.826 ← new best
# Improvement over best single model (RF): +0.003 R² / -0.04 RMSE
# SLA accuracy: 93.5% | precision: 94.4% | recall: 96.3%
SLA = 30 # minutes — the promise shown to customers at checkout
def sla_metrics(y_true, y_pred, sla=SLA):
"""Treats delivery prediction as a binary 'on-time / at-risk' classifier."""
actual_yes = y_true <= sla
pred_yes = y_pred <= sla
tp = (actual_yes & pred_yes).sum()
fp = (~actual_yes & pred_yes).sum()
fn = (actual_yes & ~pred_yes).sum()
tn = (~actual_yes & ~pred_yes).sum()
return {
"precision": tp / (tp + fp), # of predicted on-time, how many were?
"recall": tp / (tp + fn), # of truly on-time, how many did we catch?
"accuracy": (tp + tn) / len(y_true),
}
# Naive model: always predicts the training-set mean (26.3 min)
naive_pred = np.full_like(y_test.values, y_train.mean(), dtype=float)
# Stacking Ensemble SLA accuracy: 93.5% | precision: 94.4% | recall: 96.3%
# Naive baseline SLA accuracy: 70.2% — >23pp gap attributable to the model
Visualizations
Four multi-panel charts walk through each stage of the analysis — from raw data exploration through model comparison, feature importance, and SLA business impact. Each figure is built with plt.subplots() on a dark background to match the portfolio’s visual language.
distance_km consistently ranks in the top 4 across all three algorithms — validating the haversine engineering step.Conclusion
The biggest lesson from this project is that raw data rarely contains the right representation. The dataset ships with GPS coordinates — but latitude and longitude alone carry almost no predictive signal when treated as plain numerics. Converting them to a haversine distance produced a top-4 feature across all three importance charts.
The second lesson is about model ceilings. When XGBoost, HistGBRT, LightGBM, and CatBoost all converge to within 0.001 R² of each other, adding another base model will not help — but a stacking ensemble that blends their out-of-fold predictions can still extract additional signal by exploiting different error patterns among the learners.
The SLA framing also changes the conversation with stakeholders. A standalone RMSE of 3.90 minutes is hard to contextualize; “93.5% of our 30-minute SLA promises are correct — up 23 points over guessing the average” is a number an operations team can build a customer-communication strategy around.