Thyroid Cancer Recurrence

This dataset focuses on thyroid cancer recurrence after Radioactive Iodine (RAI) therapy. It contains 383 patient records with 13 key attributes. This data will be used to build predictive models for future cancer recurrence using logistic regression, random forest, and XGBoost.

Column Descriptions:

Age : Age of the patient (in years).
Gender : Patient’s gender (Male or Female).
Hx Radiotherapy : History of prior radiotherapy (Yes or No).
Adenopathy : Presence of lymph node involvement (Yes or No).
Pathology : Type of thyroid cancer (e.g., Micropapillary).
Focality : Tumor focality (Uni-Focal or Multi-Focal).
Risk : Cancer risk classification (Low, Intermediate, High).
T : Tumor classification (T1, T2, etc.).
N : Lymph node classification (N0, N1, etc.).
M : Metastasis classification (M0, M1, etc.).
Stage : Cancer staging (Stage I, II, III, IV).
Response : Treatment response (Excellent, Indeterminate, etc.).
Recurred : Whether cancer recurred (Yes or No).

What is the average age of a patient with thyroid cancer?

mean_age <- thyroid_data |> 
  summarize(mean_age = mean(Age)) |>
  unlist() 

min(thyroid_data$Age)

[1] 15

max(thyroid_data$Age)

[1] 82

ggplot(thyroid_data, aes(x = Age)) +
  geom_histogram(bins = 20, fill = "lightblue", color= "black") +
  scale_x_continuous(breaks = seq(10, 90, by = 10)) +
  labs(title = "Age Histogram", x = "Age", y = "Number of Patients") +
  geom_vline(xintercept = mean_age, color = "red", linetype = "dashed") +
  annotate("text", x = 50, y = 45,
           label = glue("Average Age {round(mean_age, 2)}"), 
           color = "red")

The histogram shows the age distribution of the patients.

gender_counts <- thyroid_data |>
  count(Gender, name = "total")

total_pt <- nrow(thyroid_data)

glue("The percentage of female patients: {round(gender_counts[1,2]/total_pt * 100, 2)}%\n The percent of male patients: {round(gender_counts[2,2]/total_pt * 100, 2)}%")

The percentage of female patients: 81.46%
The percent of male patients: 18.54%

What are the recurrence rates based on gender?

recurred_sum <- thyroid_data |>
  summarize(recurred_sum = sum(as.numeric(Recurred)-1), .by = Gender)

recurred_graph <- right_join(recurred_sum, gender_counts, by = "Gender") |>
  mutate(no_recurred = total - recurred_sum)

recurred_graph1 <- recurred_graph |>
  pivot_longer(cols = c(recurred_sum, no_recurred),
               names_to = "Status", values_to = "Count")

ggplot(recurred_graph1, aes(x = Gender, y = Count, fill = Status)) +
  geom_col() +
  scale_fill_manual(
    values = c("recurred_sum" = "red3", "no_recurred" = "lightblue3"),
    labels = c("Recurred", "Did Not Recur"))

About 81% of patients are female and 19% are male. The bar chart reveals that male patients have a higher recurrence rate than females, suggesting gender might be an important predictor.

Logistic regression model

thyroid_data |>
  count(Recurred)

  Recurred   n
1       No 275
2      Yes 108

#split data 
set.seed(2)
split <- initial_split(thyroid_data, prop = 0.8, strata = Recurred)
train <- training(split)
test <- testing(split)

ggplot(train, aes(x = Age)) +
  geom_boxplot(aes(y = "Train")) +
  geom_boxplot(data = test, aes(y = "Test"))

vfold <- vfold_cv(train, strata = Recurred, v = 10)

The data is split into 80% for training and 20% for testing. Since recurrence happens in about 40% of cases, the split is stratified on the Recurrence feature to preserve class balance. The random split looks balanced between the test and training set when checked on the Age feature.

recipe_recurred <- recipe(Recurred ~., data = thyroid_data) |>
  step_normalize(Age) |>
  step_dummy(all_nominal_predictors())
  
logistic_model <- logistic_reg(penalty = 0.1, mixture = 1) |>
  set_engine("glm") |>
  set_mode("classification") 

logistic_wf <- workflow() |>
  add_model(logistic_model) |>
  add_recipe(recipe_recurred)

logistic_fit <- fit(logistic_wf, data = train)

tidy(logistic_fit) |>
  filter(p.value <= 0.05) |> print()

# A tibble: 3 × 5
  term                   estimate std.error statistic p.value
  <chr>                     <dbl>     <dbl>     <dbl>   <dbl>
1 Gender_M                   2.41      1.21      2.00 0.0455 
2 Response_Excellent        -6.90      2.38     -2.90 0.00377
3 Response_Indeterminate    -4.55      1.68     -2.71 0.00670

exp(c(2.4, -6.9, -4.55))

[1] 11.023176381  0.001007785  0.010567204

Male gender and treatment response levels are statistically significant predictors of thyroid cancer recurrence. Male patients have approximately 11 times higher odds of recurrence compared to female patients, holding all other variables constant. Additionally, patients with excellent and intermediate responses to treatment have a less than 1% chance of cancer recurrence.

Random Forest Model

rf_model <- rand_forest() |>
  set_mode("classification") |>
  set_engine("ranger", 
            importance = "impurity")

rf_workflow <- workflow() |>
  add_recipe(recipe_recurred) |>
  add_model(rf_model)

rf_fit <- fit(rf_workflow, data = train)
vip(rf_fit)

This plot shows which features most strongly influence predictions in the random forest model.

XGBoost Model

xgb_model <- boost_tree() |>
  set_mode("classification") |>
  set_engine("xgboost")

xgb_workflow <- workflow() |>
  add_model(xgb_model) |>
  add_recipe(recipe_recurred)

xgb_fit <- fit(xgb_workflow, data = train)
vip(xgb_fit)

This plot shows which features most strongly influence predictions in the xgboost model.

Validation

The validation process employs 10-fold cross-validation to assess model performances. This approach splits the training data into 10 subsets, trains the model on 9 fold, and evaluated on the last remaining fold. The process is repeated 10 times, ensuring that each data point is used for both training and validation. Cross training helps mitigate overfitting.

workflow_names <- c("logit", "rf", "xgboost")
workflow_variables <- list(logistic_wf, rf_workflow, xgb_workflow)
class_metric <- metric_set(f_meas, brier_class, roc_auc, mcc)

workflows_vfold <- tibble(workflow_name = workflow_names,
                    workflow_object = workflow_variables) |>
  rowwise()|>
  mutate(fits = list(fit_resamples(workflow_object,
                                   vfold,
                                   metrics = class_metric))) |>
  mutate(metrics = list(collect_metrics(fits)))

Predictions

predictions <- tibble(workflow_name = workflow_names,
                    workflow_fit = list(logistic_fit, rf_fit, xgb_fit)) |>
  rowwise() |>
  mutate(pred_class = list(predict(workflow_fit,
                                   test,
                                   type = "class"))) |>
  mutate(pred_prob = list(predict(workflow_fit,
                                  test,
                                  type = "prob")))
predictions <- predictions |>
  mutate(predictions = list(bind_cols(pred_class, pred_prob))) |>
  select(-c(pred_class, pred_prob))

predictions  <- predictions |>
  select(workflow_name, predictions) |>
  unnest(cols = c(predictions)) |>
  cbind(Recurred = factor(test$Recurred))

for (model in workflow_names){
    cat("\nConfusion Matrix for", model, ":\n")
    predictions |>
    filter(workflow_name == model) |>
    conf_mat(truth = Recurred, estimate = .pred_class) |>
    print()
}


Confusion Matrix for logit :
          Truth
Prediction No Yes
       No  55   3
       Yes  0  19

Confusion Matrix for rf :
          Truth
Prediction No Yes
       No  54   0
       Yes  1  22

Confusion Matrix for xgboost :
          Truth
Prediction No Yes
       No  54   0
       Yes  1  22

predictions |>
  group_by(workflow_name)  |> 
  mcc(truth = Recurred, estimate =.pred_class)

# A tibble: 3 × 4
  workflow_name .metric .estimator .estimate
  <chr>         <chr>   <chr>          <dbl>
1 logit         mcc     binary         0.905
2 rf            mcc     binary         0.969
3 xgboost       mcc     binary         0.969

Model Assessment

The models are evaluated using the following metrics:

F1 Score: Balances precision and recall. Better for evaluating imbalanced datasets compared to accuracy. A high score suggests effective prediction.
Brier Score: Measures the accuracy of probabilistic predictions. Lower values are better.
Matthews Correlation Coefficient: Useful when classes are imbalanced.
ROC AUC: Evaluates models ability to distinguish between classes. Higest value indicate better distinction.

workflows_vfold |>
  select(workflow_name,
           metrics) |>
  unnest(metrics) |>
  ggplot(aes(y = workflow_name,
             x = mean)) +
  geom_col(aes(fill = workflow_name)) +
  facet_wrap(~.metric,
             nrow = 3,
             scales = "free_x") +
  theme(legend.position = "none")

This plot helps identify which model performed best on each metric.

Conclusion

This analysis explored the prediction of thyroid cancer recurrence following Radioactive Iodine (RAI) therapy using logistic regression, random forest, and XGBoost. After evaluating model performance with cross-validation and multiple metrics (F1 score, ROC AUC, Brier score, and Matthews Correlation Coefficient), all three models demonstrated reasonable classification ability, with differences in performance depending on the metric used.

Logistic regression revealed that male gender and treatment response were statistically significant predictors of recurrence. Specifically, male patients had approximately 11 times higher odds of experiencing cancer recurrence compared to female patients, when controlling for other variables. Patients who showed an Excellent or Indeterminate response to treatment had substantially lower odds of recurrence, less than 1% of the odds compared to the reference group.

Among the models, random forest showed the strongest predictive performance. However, model selection should be informed by the context in which predictions are used. For example, in clinical settings, minimizing false negatives may be more critical than overall accuracy.

Limitations of this study include a moderate sample size (n = 383) and some class imbalance between recurrence outcomes. Future work could benefit from external validation using a separate dataset, incorporating additional clinical variables, and potentially using ensemble methods or calibration techniques to refine probability estimates.

In summary, predictive modeling can provide valuable support in identifying patients at higher risk of recurrence, aiding clinicians in tailoring follow-up strategies and treatment plans more effectively.