← Back to portfolio
Machine Learning Case Study — Kaggle Competition

Predicting Telecom Customer Churn with tidymodels & XGBoost

A full end-to-end supervised learning pipeline built in R using the tidymodels ecosystem. The project covers exploratory analysis, feature engineering, a tunable XGBoost model with 5-fold cross-validation, hyperparameter search over 30 Latin-hypercube candidates, and a business-impact analysis that quantifies retained revenue versus retention spend.

R tidymodels XGBoost tidyverse vip Kaggle ROC AUC

Business Problem

Customer churn is one of the most costly challenges in subscription-based industries. In telecom, acquiring a new customer costs 5–25x more than retaining an existing one. A telecom company with 7,000 customers and a 26% annual churn rate is quietly losing a significant share of its recurring revenue each year.

The goal of this project is to build a model that identifies customers who are likely to churn before they leave, so the retention team can intervene with targeted offers. Every modeling decision — from metric selection to hyperparameter tuning — is framed around reducing missed churners (false negatives) while keeping false alarms (false positives) low enough to avoid wasting retention budget.

Dataset: IBM Telco Customer Churn (Kaggle competition format). 7,043 customers, 20 features covering demographics, service subscriptions, contract type, billing method, and monthly/total charges.

Pipeline Architecture & Tools

The pipeline follows the idiomatic tidymodels workflow: define a recipe, specify a model, assemble a workflow, resample with cross-validation, tune, and finalize. This structure makes every step independently testable and composable.

tidymodels

Core framework: rsample for stratified splits and CV folds, recipes for the preprocessing pipeline, parsnip for the model interface, workflows to bind recipe + model, tune for grid search, and yardstick for evaluation metrics.

XGBoost

Gradient boosted trees with 7 tunable hyperparameters: trees, tree_depth, min_n, loss_reduction, sample_size, mtry, and learn_rate. Engine bound via parsnip for a clean, framework-agnostic interface.

Latin Hypercube Search

30-candidate grid_latin_hypercube() design provides better coverage of the 7-dimensional hyperparameter space than a regular grid of the same size — especially important when some dimensions (e.g., learn_rate on log scale) span orders of magnitude.

vip

Variable importance plots from the final fitted model, connecting feature signals back to business context: tenure, contract type, and monthly charges dominate the churn signal.

The Code

Setup & library loading R
library(tidyverse)    # readr, dplyr, ggplot2, tidyr, purrr, stringr, forcats
library(tidymodels)   # rsample, recipes, parsnip, workflows, tune, yardstick
library(xgboost)      # XGBoost engine
library(vip)          # Variable importance plots
library(finetune)     # Optional: racing methods for faster tuning

tidymodels_prefer()   # Suppress conflicts with base R (e.g. yardstick::rmse)
set.seed(42)          # Global reproducibility seed
Data loading & type coercion R
# A shared cleaning helper applied to both train and test.
# TotalCharges is stored as character in the raw CSV because brand-new
# customers (tenure == 0) have a blank string instead of 0 — parse_number()
# silently converts those to NA, which is then imputed downstream.
clean_telco <- function(df) {
  df |>
    mutate(
      TotalCharges  = parse_number(as.character(TotalCharges)),
      SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1),
                             labels = c("No", "Yes"))
    )
}

train_raw <- raw_train |>
  clean_telco() |>
  mutate(
    # Target: must be a factor; positive class = "Yes" (churned)
    Churn = factor(Churn, levels = c("Yes", "No"))
  )

# Quick class balance check
train_raw |> count(Churn) |> mutate(pct = n / sum(n) * 100)
# Churn   n    pct
# Yes   1869  26.5%
# No    5174  73.5%
Preprocessing recipe R
churn_recipe <- recipe(Churn ~ ., data = df_train) |>

  # Remove the id column — identifier, not a signal
  update_role(id, new_role = "ID") |>

  # Imputation: median for numerics, mode for nominals
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>

  # Feature engineering: customers with no internet have many "No internet
  # service" sub-features. Flag this explicitly so the model can distinguish
  # "I don't want this add-on" vs "I have no internet at all".
  step_mutate(
    no_internet = if_else(InternetService == "No", 1L, 0L)
  ) |>

  # One-hot encode all nominal predictors (drop first level)
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>

  # Center and scale all numeric predictors
  step_normalize(all_numeric_predictors()) |>

  # Remove near-zero-variance columns (can appear after dummy encoding)
  step_nzv(all_predictors())
Model specification & workflow R
# All 7 key XGBoost hyperparameters are marked tune() for grid search.
xgb_spec <- boost_tree(
  trees          = tune(),   # number of boosting rounds
  tree_depth     = tune(),   # max depth per tree
  min_n          = tune(),   # minimum observations per leaf
  loss_reduction = tune(),   # gamma — minimum loss reduction to split
  sample_size    = tune(),   # row subsampling fraction
  mtry           = tune(),   # column subsampling fraction
  learn_rate     = tune()    # eta — step shrinkage
) |>
  set_engine("xgboost", nthread = parallel::detectCores() - 1) |>
  set_mode("classification")

# A workflow bundles recipe + model into a single object that can be
# tuned, evaluated, and finalized with a consistent API.
churn_workflow <- workflow() |>
  add_recipe(churn_recipe) |>
  add_model(xgb_spec)
Hyperparameter tuning — Latin hypercube search R
cv_folds <- vfold_cv(df_train, v = 5, strata = Churn)

xgb_grid <- grid_latin_hypercube(
  trees(range = c(200L, 1000L)),
  tree_depth(range = c(3L, 8L)),
  min_n(range = c(5L, 30L)),
  loss_reduction(range = c(0, 5)),
  sample_size = sample_prop(range = c(0.6, 1.0)),
  finalize(mtry(), df_train),   # resolves # predictors at runtime
  learn_rate(range = c(-3, -1), trans = log10_trans()),
  size = 30                     # 30 candidate combinations
)

tuning_results <- tune_grid(
  churn_workflow,
  resamples = cv_folds,
  grid      = xgb_grid,
  metrics   = metric_set(roc_auc, accuracy, f_meas, pr_auc),
  control   = control_grid(save_pred = TRUE, verbose = TRUE)
)

# Select and finalize the best configuration
best_params    <- select_best(tuning_results, metric = "roc_auc")
final_workflow <- finalize_workflow(churn_workflow, best_params)

# last_fit() trains on df_train and evaluates on held-out df_val
final_fit <- last_fit(final_workflow, split = data_split,
                       metrics = metric_set(roc_auc, accuracy, f_meas, pr_auc))
Submission output — churn probabilities R
fitted_workflow <- extract_workflow(final_fit)

# Generate class probabilities on the unseen test set
test_predictions <- predict(fitted_workflow,
                             new_data = test_clean,
                             type = "prob")

# Submission format: id | Churn (P(Churn = "Yes"))
submission <- test_clean |>
  select(id) |>
  bind_cols(test_predictions) |>
  rename(Churn = .pred_Yes) |>
  select(id, Churn)

# Sanity checks before writing
stopifnot(
  nrow(submission) == nrow(test_clean),
  all(between(submission$Churn, 0, 1)),
  !anyNA(submission)
)

write_csv(submission, SUBMISSION_PATH)
cat("Submission written to:", SUBMISSION_PATH, "\n")

Visualizations

Four figures walk through each analytical stage — raw data exploration, model performance, feature importance, and financial impact. Every chart uses the same dark-theme palette as this site.

Exploratory Data Analysis. Class imbalance (73.5% retained / 26.5% churned), churn rate by contract type showing month-to-month customers churn at 3× the rate of two-year subscribers, and distribution plots revealing that churners tend to have shorter tenure and higher monthly charges.
Model Performance. ROC curve and Precision-Recall curve for the tuned XGBoost model, confusion matrix on the 20% hold-out set, and grouped 5-fold cross-validated metrics. ROC AUC exceeds 0.84 with consistent performance across folds — a strong baseline for a production retention system.
Feature Importance. Tenure, Monthly Charges, and Total Charges dominate the model's decisions. Contract type and online security are the most predictive categorical features — results that align strongly with domain intuition and give the retention team clear levers to pull.
Business Impact. Revenue analysis translating TP/FN counts into dollars of retained or lost annual revenue, a segment heatmap showing that Fiber Optic + Month-to-Month is the highest-risk combination (>40% churn rate), and a KPI summary linking model metrics directly to retention economics.

Conclusion & Key Insights

The pipeline achieves a ROC AUC above 0.84 with consistent 5-fold cross-validation scores, confirming the model generalize well beyond the training set. More importantly, it surfaces a clear business narrative: the highest-risk customers are those on month-to-month contracts with Fiber Optic plans — they churn at over 40% annually and tend to have the highest monthly charges.

The feature importance analysis gives the retention team direct action items. Tenure is the strongest predictor — short-tenure customers who are still on high-cost service bundles should be the first target for proactive outreach. Offering these customers a contract upgrade or a discounted annual plan would cut churn risk at the most impactful moment: before they become habitual month-to-month subscribers.

The business impact calculation shows that catching even 60–70% of churners before they leave — at a $50 retention offer per customer — produces a meaningful positive net benefit versus the alternative of letting them cancel and then paying a far higher re-acquisition cost.