Machine Learning Tutorial¶

This notebook demonstrates a complete machine learning workflow from data preparation to model evaluation.

Objectives¶

Understand the machine learning workflow
Prepare data for machine learning
Train and evaluate different models
Interpret model results
Make predictions on new data

1. Data Preparation¶

Let's start by importing libraries and creating a dataset for our machine learning task.

In [ ]:

Copied!





# Import essential libraries
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC

warnings.filterwarnings("ignore")

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn available: Yes")
# Import essential libraries
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC

warnings.filterwarnings("ignore")

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn available: Yes")

In [ ]:

Copied!





# Create a synthetic dataset for classification
# This simulates a customer churn prediction problem
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=7,
    n_redundant=2,
    n_clusters_per_class=1,
    random_state=42,
)

# Create feature names
feature_names = [
    "age",
    "income",
    "tenure_months",
    "monthly_charges",
    "total_charges",
    "support_tickets",
    "satisfaction_score",
    "usage_hours",
    "contract_length",
    "payment_method",
]

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df["churn"] = y  # 1 = customer churned, 0 = customer retained

# Make the data more realistic
df["age"] = np.clip(df["age"] * 10 + 45, 18, 80).astype(int)
df["income"] = np.clip(df["income"] * 15000 + 50000, 20000, 120000).astype(int)
df["tenure_months"] = np.clip(df["tenure_months"] * 12 + 24, 1, 72).astype(int)
df["monthly_charges"] = np.clip(df["monthly_charges"] * 20 + 50, 20, 150).round(2)
df["satisfaction_score"] = np.clip(df["satisfaction_score"] * 2 + 3, 1, 5).round(1)

print(f"Dataset created with {len(df)} customers")
print(f"Features: {len(feature_names)}")
print(f"Churn rate: {df['churn'].mean():.2%}")
# Create a synthetic dataset for classification
# This simulates a customer churn prediction problem
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=7,
    n_redundant=2,
    n_clusters_per_class=1,
    random_state=42,
)

# Create feature names
feature_names = [
    "age",
    "income",
    "tenure_months",
    "monthly_charges",
    "total_charges",
    "support_tickets",
    "satisfaction_score",
    "usage_hours",
    "contract_length",
    "payment_method",
]

# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df["churn"] = y  # 1 = customer churned, 0 = customer retained

# Make the data more realistic
df["age"] = np.clip(df["age"] * 10 + 45, 18, 80).astype(int)
df["income"] = np.clip(df["income"] * 15000 + 50000, 20000, 120000).astype(int)
df["tenure_months"] = np.clip(df["tenure_months"] * 12 + 24, 1, 72).astype(int)
df["monthly_charges"] = np.clip(df["monthly_charges"] * 20 + 50, 20, 150).round(2)
df["satisfaction_score"] = np.clip(df["satisfaction_score"] * 2 + 3, 1, 5).round(1)

print(f"Dataset created with {len(df)} customers")
print(f"Features: {len(feature_names)}")
print(f"Churn rate: {df['churn'].mean():.2%}")

2. Exploratory Data Analysis¶

Let's explore our dataset to understand the characteristics of customers who churn.

In [ ]:

Copied!





# Basic dataset information
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nTarget distribution:")
print(df["churn"].value_counts())
print(f"\nClass balance: {df['churn'].value_counts(normalize=True)}")
# Basic dataset information
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset Info:")
print(df.info())

print("\nTarget distribution:")
print(df["churn"].value_counts())
print(f"\nClass balance: {df['churn'].value_counts(normalize=True)}")

In [ ]:

Copied!

# Statistical summary
print("Statistical Summary:")
df.describe()
# Statistical summary
print("Statistical Summary:")
df.describe()

In [ ]:

Copied!





# Visualize the target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Churn distribution
df["churn"].value_counts().plot(
    kind="bar", ax=axes[0], color=["lightblue", "lightcoral"]
)
axes[0].set_title("Customer Churn Distribution")
axes[0].set_xlabel("Churn (0=No, 1=Yes)")
axes[0].set_ylabel("Count")
axes[0].tick_params(axis="x", rotation=0)

# Churn percentage
churn_pct = df["churn"].value_counts(normalize=True)
axes[1].pie(
    churn_pct.values,
    labels=["Retained", "Churned"],
    autopct="%1.1f%%",
    colors=["lightblue", "lightcoral"],
)
axes[1].set_title("Customer Churn Percentage")

plt.tight_layout()
plt.show()
# Visualize the target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Churn distribution
df["churn"].value_counts().plot(
    kind="bar", ax=axes[0], color=["lightblue", "lightcoral"]
)
axes[0].set_title("Customer Churn Distribution")
axes[0].set_xlabel("Churn (0=No, 1=Yes)")
axes[0].set_ylabel("Count")
axes[0].tick_params(axis="x", rotation=0)

# Churn percentage
churn_pct = df["churn"].value_counts(normalize=True)
axes[1].pie(
    churn_pct.values,
    labels=["Retained", "Churned"],
    autopct="%1.1f%%",
    colors=["lightblue", "lightcoral"],
)
axes[1].set_title("Customer Churn Percentage")

plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Feature analysis by churn status
# Select key features for analysis
key_features = [
    "age",
    "income",
    "tenure_months",
    "monthly_charges",
    "satisfaction_score",
]

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(key_features):
    # Box plot by churn status
    df.boxplot(column=feature, by="churn", ax=axes[i])
    axes[i].set_title(f'{feature.replace("_", " ").title()} by Churn Status')
    axes[i].set_xlabel("Churn (0=No, 1=Yes)")

# Remove the empty subplot
axes[5].remove()

plt.tight_layout()
plt.show()
# Feature analysis by churn status
# Select key features for analysis
key_features = [
    "age",
    "income",
    "tenure_months",
    "monthly_charges",
    "satisfaction_score",
]

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, feature in enumerate(key_features):
    # Box plot by churn status
    df.boxplot(column=feature, by="churn", ax=axes[i])
    axes[i].set_title(f'{feature.replace("_", " ").title()} by Churn Status')
    axes[i].set_xlabel("Churn (0=No, 1=Yes)")

# Remove the empty subplot
axes[5].remove()

plt.tight_layout()
plt.show()

3. Feature Engineering¶

Let's prepare our features for machine learning by scaling and encoding them appropriately.

In [ ]:

Copied!





# Separate features and target
X = df.drop("churn", axis=1)
y = df["churn"]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")
# Separate features and target
X = df.drop("churn", axis=1)
y = df["churn"]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")

In [ ]:

Copied!





# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set churn rate: {y_train.mean():.2%}")
print(f"Test set churn rate: {y_test.mean():.2%}")
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set churn rate: {y_train.mean():.2%}")
print(f"Test set churn rate: {y_test.mean():.2%}")

In [ ]:

Copied!





# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

print("Features scaled successfully!")
print(f"\nScaled training features - Mean: {X_train_scaled.mean().mean():.4f}")
print(f"Scaled training features - Std: {X_train_scaled.std().mean():.4f}")
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrames for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)

print("Features scaled successfully!")
print(f"\nScaled training features - Mean: {X_train_scaled.mean().mean():.4f}")
print(f"Scaled training features - Std: {X_train_scaled.std().mean():.4f}")

4. Model Training¶

Let's train multiple machine learning models and compare their performance.

In [ ]:

Copied!





# Initialize models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel="rbf", random_state=42),
}

# Train models and store results
model_results = {}
trained_models = {}

print("Training models...\n")

for name, model in models.items():
    print(f"Training {name}...")

    # Train the model
    model.fit(X_train_scaled, y_train)
    trained_models[name] = model

    # Cross-validation score
    cv_scores = cross_val_score(
        model, X_train_scaled, y_train, cv=5, scoring="accuracy"
    )

    # Predictions
    y_pred = model.predict(X_test_scaled)

    # Store results
    model_results[name] = {
        "cv_mean": cv_scores.mean(),
        "cv_std": cv_scores.std(),
        "test_accuracy": accuracy_score(y_test, y_pred),
        "predictions": y_pred,
    }

    print(f"  CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print(f"  Test Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")

print("All models trained successfully!")
# Initialize models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel="rbf", random_state=42),
}

# Train models and store results
model_results = {}
trained_models = {}

print("Training models...\n")

for name, model in models.items():
    print(f"Training {name}...")

    # Train the model
    model.fit(X_train_scaled, y_train)
    trained_models[name] = model

    # Cross-validation score
    cv_scores = cross_val_score(
        model, X_train_scaled, y_train, cv=5, scoring="accuracy"
    )

    # Predictions
    y_pred = model.predict(X_test_scaled)

    # Store results
    model_results[name] = {
        "cv_mean": cv_scores.mean(),
        "cv_std": cv_scores.std(),
        "test_accuracy": accuracy_score(y_test, y_pred),
        "predictions": y_pred,
    }

    print(f"  CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    print(f"  Test Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")

print("All models trained successfully!")

5. Model Evaluation¶

Let's evaluate our models using various metrics and visualizations.

In [ ]:

Copied!





# Compare model performance
results_df = pd.DataFrame(model_results).T
results_df = results_df[["cv_mean", "cv_std", "test_accuracy"]]
results_df.columns = ["CV_Mean", "CV_Std", "Test_Accuracy"]

print("Model Performance Comparison:")
print(results_df.round(4))

# Visualize model comparison
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
x_pos = np.arange(len(results_df))

bars = ax.bar(
    x_pos,
    results_df["Test_Accuracy"],
    yerr=results_df["CV_Std"],
    alpha=0.7,
    color=["lightblue", "lightgreen", "lightcoral"],
    capsize=5,
)

ax.set_xlabel("Models")
ax.set_ylabel("Accuracy")
ax.set_title("Model Performance Comparison")
ax.set_xticks(x_pos)
ax.set_xticklabels(results_df.index, rotation=45)
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        height + 0.01,
        f"{height:.3f}",
        ha="center",
        va="bottom",
    )

plt.tight_layout()
plt.show()
# Compare model performance
results_df = pd.DataFrame(model_results).T
results_df = results_df[["cv_mean", "cv_std", "test_accuracy"]]
results_df.columns = ["CV_Mean", "CV_Std", "Test_Accuracy"]

print("Model Performance Comparison:")
print(results_df.round(4))

# Visualize model comparison
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
x_pos = np.arange(len(results_df))

bars = ax.bar(
    x_pos,
    results_df["Test_Accuracy"],
    yerr=results_df["CV_Std"],
    alpha=0.7,
    color=["lightblue", "lightgreen", "lightcoral"],
    capsize=5,
)

ax.set_xlabel("Models")
ax.set_ylabel("Accuracy")
ax.set_title("Model Performance Comparison")
ax.set_xticks(x_pos)
ax.set_xticklabels(results_df.index, rotation=45)
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)

# Add value labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    ax.text(
        bar.get_x() + bar.get_width() / 2.0,
        height + 0.01,
        f"{height:.3f}",
        ha="center",
        va="bottom",
    )

plt.tight_layout()
plt.show()

In [ ]:

Copied!





# Detailed evaluation for the best model
best_model_name = results_df["Test_Accuracy"].idxmax()
best_model = trained_models[best_model_name]
best_predictions = model_results[best_model_name]["predictions"]

print(f"Best Model: {best_model_name}")
print(f"Test Accuracy: {model_results[best_model_name]['test_accuracy']:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, best_predictions))

# Confusion matrix
cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["No Churn", "Churn"],
    yticklabels=["No Churn", "Churn"],
)
plt.title(f"Confusion Matrix - {best_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Calculate additional metrics
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
specificity = tn / (tn + fp)
f1 = 2 * (precision * recall) / (precision + recall)

print(f"\nAdditional Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"Specificity: {specificity:.4f}")
print(f"F1-Score: {f1:.4f}")
# Detailed evaluation for the best model
best_model_name = results_df["Test_Accuracy"].idxmax()
best_model = trained_models[best_model_name]
best_predictions = model_results[best_model_name]["predictions"]

print(f"Best Model: {best_model_name}")
print(f"Test Accuracy: {model_results[best_model_name]['test_accuracy']:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, best_predictions))

# Confusion matrix
cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=["No Churn", "Churn"],
    yticklabels=["No Churn", "Churn"],
)
plt.title(f"Confusion Matrix - {best_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Calculate additional metrics
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
specificity = tn / (tn + fp)
f1 = 2 * (precision * recall) / (precision + recall)

print(f"\nAdditional Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"Specificity: {specificity:.4f}")
print(f"F1-Score: {f1:.4f}")

In [ ]:

Copied!





# Feature importance (for Random Forest)
if best_model_name == "Random Forest":
    feature_importance = pd.DataFrame(
        {"feature": X.columns, "importance": best_model.feature_importances_}
    ).sort_values("importance", ascending=False)

    plt.figure(figsize=(10, 6))
    sns.barplot(data=feature_importance, x="importance", y="feature", palette="viridis")
    plt.title("Feature Importance - Random Forest")
    plt.xlabel("Importance")
    plt.tight_layout()
    plt.show()

    print("Top 5 Most Important Features:")
    print(feature_importance.head())
else:
    print(f"Feature importance not available for {best_model_name}")
# Feature importance (for Random Forest)
if best_model_name == "Random Forest":
    feature_importance = pd.DataFrame(
        {"feature": X.columns, "importance": best_model.feature_importances_}
    ).sort_values("importance", ascending=False)

    plt.figure(figsize=(10, 6))
    sns.barplot(data=feature_importance, x="importance", y="feature", palette="viridis")
    plt.title("Feature Importance - Random Forest")
    plt.xlabel("Importance")
    plt.tight_layout()
    plt.show()

    print("Top 5 Most Important Features:")
    print(feature_importance.head())
else:
    print(f"Feature importance not available for {best_model_name}")

6. Predictions¶

Let's use our best model to make predictions on new data.

In [ ]:

Copied!





# Create new customer data for prediction
new_customers = pd.DataFrame(
    {
        "age": [25, 45, 65],
        "income": [35000, 75000, 90000],
        "tenure_months": [6, 24, 48],
        "monthly_charges": [75.50, 55.20, 89.90],
        "total_charges": [0.5, 1.2, -0.3],
        "support_tickets": [2.1, 0.1, -1.5],
        "satisfaction_score": [2.5, 4.2, 3.8],
        "usage_hours": [1.2, -0.5, 0.8],
        "contract_length": [-0.8, 0.2, 1.1],
        "payment_method": [0.3, -0.9, 0.6],
    }
)

print("New customers to predict:")
print(new_customers)

# Scale the new data using the same scaler
new_customers_scaled = scaler.transform(new_customers)

# Make predictions
predictions = best_model.predict(new_customers_scaled)
prediction_proba = best_model.predict_proba(new_customers_scaled)

# Create results DataFrame
prediction_results = new_customers.copy()
prediction_results["predicted_churn"] = predictions
prediction_results["churn_probability"] = prediction_proba[:, 1]
prediction_results["risk_level"] = pd.cut(
    prediction_proba[:, 1], bins=[0, 0.3, 0.7, 1.0], labels=["Low", "Medium", "High"]
)

print("\nPrediction Results:")
display_cols = [
    "age",
    "income",
    "tenure_months",
    "satisfaction_score",
    "predicted_churn",
    "churn_probability",
    "risk_level",
]
print(prediction_results[display_cols])
# Create new customer data for prediction
new_customers = pd.DataFrame(
    {
        "age": [25, 45, 65],
        "income": [35000, 75000, 90000],
        "tenure_months": [6, 24, 48],
        "monthly_charges": [75.50, 55.20, 89.90],
        "total_charges": [0.5, 1.2, -0.3],
        "support_tickets": [2.1, 0.1, -1.5],
        "satisfaction_score": [2.5, 4.2, 3.8],
        "usage_hours": [1.2, -0.5, 0.8],
        "contract_length": [-0.8, 0.2, 1.1],
        "payment_method": [0.3, -0.9, 0.6],
    }
)

print("New customers to predict:")
print(new_customers)

# Scale the new data using the same scaler
new_customers_scaled = scaler.transform(new_customers)

# Make predictions
predictions = best_model.predict(new_customers_scaled)
prediction_proba = best_model.predict_proba(new_customers_scaled)

# Create results DataFrame
prediction_results = new_customers.copy()
prediction_results["predicted_churn"] = predictions
prediction_results["churn_probability"] = prediction_proba[:, 1]
prediction_results["risk_level"] = pd.cut(
    prediction_proba[:, 1], bins=[0, 0.3, 0.7, 1.0], labels=["Low", "Medium", "High"]
)

print("\nPrediction Results:")
display_cols = [
    "age",
    "income",
    "tenure_months",
    "satisfaction_score",
    "predicted_churn",
    "churn_probability",
    "risk_level",
]
print(prediction_results[display_cols])

In [ ]:

Copied!





# Visualize prediction results
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Churn probability by customer
customer_ids = [f"Customer {i+1}" for i in range(len(new_customers))]
colors = ["green" if p < 0.5 else "red" for p in prediction_proba[:, 1]]

bars = axes[0].bar(customer_ids, prediction_proba[:, 1], color=colors, alpha=0.7)
axes[0].set_title("Churn Probability by Customer")
axes[0].set_xlabel("Customer")
axes[0].set_ylabel("Churn Probability")
axes[0].set_ylim(0, 1)
axes[0].axhline(
    y=0.5, color="black", linestyle="--", alpha=0.7, label="Decision Threshold"
)
axes[0].legend()

# Add probability labels
for i, bar in enumerate(bars):
    height = bar.get_height()
    axes[0].text(
        bar.get_x() + bar.get_width() / 2.0,
        height + 0.02,
        f"{height:.2f}",
        ha="center",
        va="bottom",
    )

# Risk level distribution
risk_counts = prediction_results["risk_level"].value_counts()
axes[1].pie(
    risk_counts.values,
    labels=risk_counts.index,
    autopct="%1.0f%%",
    colors=["lightgreen", "orange", "lightcoral"],
)
axes[1].set_title("Risk Level Distribution")

plt.tight_layout()
plt.show()
# Visualize prediction results
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Churn probability by customer
customer_ids = [f"Customer {i+1}" for i in range(len(new_customers))]
colors = ["green" if p < 0.5 else "red" for p in prediction_proba[:, 1]]

bars = axes[0].bar(customer_ids, prediction_proba[:, 1], color=colors, alpha=0.7)
axes[0].set_title("Churn Probability by Customer")
axes[0].set_xlabel("Customer")
axes[0].set_ylabel("Churn Probability")
axes[0].set_ylim(0, 1)
axes[0].axhline(
    y=0.5, color="black", linestyle="--", alpha=0.7, label="Decision Threshold"
)
axes[0].legend()

# Add probability labels
for i, bar in enumerate(bars):
    height = bar.get_height()
    axes[0].text(
        bar.get_x() + bar.get_width() / 2.0,
        height + 0.02,
        f"{height:.2f}",
        ha="center",
        va="bottom",
    )

# Risk level distribution
risk_counts = prediction_results["risk_level"].value_counts()
axes[1].pie(
    risk_counts.values,
    labels=risk_counts.index,
    autopct="%1.0f%%",
    colors=["lightgreen", "orange", "lightcoral"],
)
axes[1].set_title("Risk Level Distribution")

plt.tight_layout()
plt.show()

7. Conclusions¶

Model Performance Summary¶

We successfully trained and compared three different machine learning models:

Logistic Regression: Simple and interpretable baseline model
Random Forest: Ensemble method with feature importance
Support Vector Machine: Non-linear classification approach

Key Insights¶

Best Model: The best performing model achieved good accuracy on the test set
Feature Importance: Customer satisfaction and tenure appear to be important predictors
Prediction Capability: The model can identify customers at risk of churning

Business Applications¶

Customer Retention: Identify high-risk customers for targeted retention campaigns
Resource Allocation: Focus retention efforts on customers most likely to churn
Strategy Development: Use feature importance to understand key churn drivers

Next Steps¶

Model Improvement:
- Hyperparameter tuning
- Feature engineering
- Ensemble methods
Deployment:
- Model serialization
- Real-time prediction API
- Monitoring and retraining
Business Integration:
- Dashboard development
- Automated alerting
- A/B testing

This tutorial demonstrates a complete machine learning workflow. In production, consider more sophisticated validation techniques, model monitoring, and continuous learning approaches.