Core framework: rsample for stratified splits and CV folds, recipes for the preprocessing pipeline, parsnip for the model interface, workflows to bind recipe + model, tune for grid search, and yardstick for evaluation metrics.
Business Problem
Customer churn is one of the most costly challenges in subscription-based industries. In telecom, acquiring a new customer costs 5–25x more than retaining an existing one. A telecom company with 7,000 customers and a 26% annual churn rate is quietly losing a significant share of its recurring revenue each year.
The goal of this project is to build a model that identifies customers who are likely to churn before they leave, so the retention team can intervene with targeted offers. Every modeling decision — from metric selection to hyperparameter tuning — is framed around reducing missed churners (false negatives) while keeping false alarms (false positives) low enough to avoid wasting retention budget.
Pipeline Architecture & Tools
The pipeline follows the idiomatic tidymodels workflow: define a recipe, specify a model, assemble a workflow, resample with cross-validation, tune, and finalize. This structure makes every step independently testable and composable.
Gradient boosted trees with 7 tunable hyperparameters: trees, tree_depth, min_n, loss_reduction, sample_size, mtry, and learn_rate. Engine bound via parsnip for a clean, framework-agnostic interface.
30-candidate grid_latin_hypercube() design provides better coverage of the 7-dimensional hyperparameter space than a regular grid of the same size — especially important when some dimensions (e.g., learn_rate on log scale) span orders of magnitude.
Variable importance plots from the final fitted model, connecting feature signals back to business context: tenure, contract type, and monthly charges dominate the churn signal.
The Code
library(tidyverse) # readr, dplyr, ggplot2, tidyr, purrr, stringr, forcats
library(tidymodels) # rsample, recipes, parsnip, workflows, tune, yardstick
library(xgboost) # XGBoost engine
library(vip) # Variable importance plots
library(finetune) # Optional: racing methods for faster tuning
tidymodels_prefer() # Suppress conflicts with base R (e.g. yardstick::rmse)
set.seed(42) # Global reproducibility seed
# A shared cleaning helper applied to both train and test.
# TotalCharges is stored as character in the raw CSV because brand-new
# customers (tenure == 0) have a blank string instead of 0 — parse_number()
# silently converts those to NA, which is then imputed downstream.
clean_telco <- function(df) {
df |>
mutate(
TotalCharges = parse_number(as.character(TotalCharges)),
SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1),
labels = c("No", "Yes"))
)
}
train_raw <- raw_train |>
clean_telco() |>
mutate(
# Target: must be a factor; positive class = "Yes" (churned)
Churn = factor(Churn, levels = c("Yes", "No"))
)
# Quick class balance check
train_raw |> count(Churn) |> mutate(pct = n / sum(n) * 100)
# Churn n pct
# Yes 1869 26.5%
# No 5174 73.5%
churn_recipe <- recipe(Churn ~ ., data = df_train) |>
# Remove the id column — identifier, not a signal
update_role(id, new_role = "ID") |>
# Imputation: median for numerics, mode for nominals
step_impute_median(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
# Feature engineering: customers with no internet have many "No internet
# service" sub-features. Flag this explicitly so the model can distinguish
# "I don't want this add-on" vs "I have no internet at all".
step_mutate(
no_internet = if_else(InternetService == "No", 1L, 0L)
) |>
# One-hot encode all nominal predictors (drop first level)
step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
# Center and scale all numeric predictors
step_normalize(all_numeric_predictors()) |>
# Remove near-zero-variance columns (can appear after dummy encoding)
step_nzv(all_predictors())
# All 7 key XGBoost hyperparameters are marked tune() for grid search.
xgb_spec <- boost_tree(
trees = tune(), # number of boosting rounds
tree_depth = tune(), # max depth per tree
min_n = tune(), # minimum observations per leaf
loss_reduction = tune(), # gamma — minimum loss reduction to split
sample_size = tune(), # row subsampling fraction
mtry = tune(), # column subsampling fraction
learn_rate = tune() # eta — step shrinkage
) |>
set_engine("xgboost", nthread = parallel::detectCores() - 1) |>
set_mode("classification")
# A workflow bundles recipe + model into a single object that can be
# tuned, evaluated, and finalized with a consistent API.
churn_workflow <- workflow() |>
add_recipe(churn_recipe) |>
add_model(xgb_spec)
cv_folds <- vfold_cv(df_train, v = 5, strata = Churn)
xgb_grid <- grid_latin_hypercube(
trees(range = c(200L, 1000L)),
tree_depth(range = c(3L, 8L)),
min_n(range = c(5L, 30L)),
loss_reduction(range = c(0, 5)),
sample_size = sample_prop(range = c(0.6, 1.0)),
finalize(mtry(), df_train), # resolves # predictors at runtime
learn_rate(range = c(-3, -1), trans = log10_trans()),
size = 30 # 30 candidate combinations
)
tuning_results <- tune_grid(
churn_workflow,
resamples = cv_folds,
grid = xgb_grid,
metrics = metric_set(roc_auc, accuracy, f_meas, pr_auc),
control = control_grid(save_pred = TRUE, verbose = TRUE)
)
# Select and finalize the best configuration
best_params <- select_best(tuning_results, metric = "roc_auc")
final_workflow <- finalize_workflow(churn_workflow, best_params)
# last_fit() trains on df_train and evaluates on held-out df_val
final_fit <- last_fit(final_workflow, split = data_split,
metrics = metric_set(roc_auc, accuracy, f_meas, pr_auc))
fitted_workflow <- extract_workflow(final_fit)
# Generate class probabilities on the unseen test set
test_predictions <- predict(fitted_workflow,
new_data = test_clean,
type = "prob")
# Submission format: id | Churn (P(Churn = "Yes"))
submission <- test_clean |>
select(id) |>
bind_cols(test_predictions) |>
rename(Churn = .pred_Yes) |>
select(id, Churn)
# Sanity checks before writing
stopifnot(
nrow(submission) == nrow(test_clean),
all(between(submission$Churn, 0, 1)),
!anyNA(submission)
)
write_csv(submission, SUBMISSION_PATH)
cat("Submission written to:", SUBMISSION_PATH, "\n")
Visualizations
Four figures walk through each analytical stage — raw data exploration, model performance, feature importance, and financial impact. Every chart uses the same dark-theme palette as this site.
Conclusion & Key Insights
The pipeline achieves a ROC AUC above 0.84 with consistent 5-fold cross-validation scores, confirming the model generalize well beyond the training set. More importantly, it surfaces a clear business narrative: the highest-risk customers are those on month-to-month contracts with Fiber Optic plans — they churn at over 40% annually and tend to have the highest monthly charges.
The feature importance analysis gives the retention team direct action items. Tenure is the strongest predictor — short-tenure customers who are still on high-cost service bundles should be the first target for proactive outreach. Offering these customers a contract upgrade or a discounted annual plan would cut churn risk at the most impactful moment: before they become habitual month-to-month subscribers.
The business impact calculation shows that catching even 60–70% of churners before they leave — at a $50 retention offer per customer — produces a meaningful positive net benefit versus the alternative of letting them cancel and then paying a far higher re-acquisition cost.