Machine Learning Tutorial¶
This notebook demonstrates a complete machine learning workflow from data preparation to model evaluation.
Objectives¶
- Understand the machine learning workflow
- Prepare data for machine learning
- Train and evaluate different models
- Interpret model results
- Make predictions on new data
Table of Contents¶
1. Data Preparation¶
Let's start by importing libraries and creating a dataset for our machine learning task.
# Import essential libraries
import warnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.svm import SVC
warnings.filterwarnings("ignore")
# Set random seed for reproducibility
np.random.seed(42)
print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn available: Yes")
# Create a synthetic dataset for classification
# This simulates a customer churn prediction problem
X, y = make_classification(
n_samples=1000,
n_features=10,
n_informative=7,
n_redundant=2,
n_clusters_per_class=1,
random_state=42,
)
# Create feature names
feature_names = [
"age",
"income",
"tenure_months",
"monthly_charges",
"total_charges",
"support_tickets",
"satisfaction_score",
"usage_hours",
"contract_length",
"payment_method",
]
# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df["churn"] = y # 1 = customer churned, 0 = customer retained
# Make the data more realistic
df["age"] = np.clip(df["age"] * 10 + 45, 18, 80).astype(int)
df["income"] = np.clip(df["income"] * 15000 + 50000, 20000, 120000).astype(int)
df["tenure_months"] = np.clip(df["tenure_months"] * 12 + 24, 1, 72).astype(int)
df["monthly_charges"] = np.clip(df["monthly_charges"] * 20 + 50, 20, 150).round(2)
df["satisfaction_score"] = np.clip(df["satisfaction_score"] * 2 + 3, 1, 5).round(1)
print(f"Dataset created with {len(df)} customers")
print(f"Features: {len(feature_names)}")
print(f"Churn rate: {df['churn'].mean():.2%}")
2. Exploratory Data Analysis¶
Let's explore our dataset to understand the characteristics of customers who churn.
# Basic dataset information
print("Dataset Shape:", df.shape)
print("\nFirst 5 rows:")
print(df.head())
print("\nDataset Info:")
print(df.info())
print("\nTarget distribution:")
print(df["churn"].value_counts())
print(f"\nClass balance: {df['churn'].value_counts(normalize=True)}")
# Statistical summary
print("Statistical Summary:")
df.describe()
# Visualize the target distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Churn distribution
df["churn"].value_counts().plot(
kind="bar", ax=axes[0], color=["lightblue", "lightcoral"]
)
axes[0].set_title("Customer Churn Distribution")
axes[0].set_xlabel("Churn (0=No, 1=Yes)")
axes[0].set_ylabel("Count")
axes[0].tick_params(axis="x", rotation=0)
# Churn percentage
churn_pct = df["churn"].value_counts(normalize=True)
axes[1].pie(
churn_pct.values,
labels=["Retained", "Churned"],
autopct="%1.1f%%",
colors=["lightblue", "lightcoral"],
)
axes[1].set_title("Customer Churn Percentage")
plt.tight_layout()
plt.show()
# Feature analysis by churn status
# Select key features for analysis
key_features = [
"age",
"income",
"tenure_months",
"monthly_charges",
"satisfaction_score",
]
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()
for i, feature in enumerate(key_features):
# Box plot by churn status
df.boxplot(column=feature, by="churn", ax=axes[i])
axes[i].set_title(f'{feature.replace("_", " ").title()} by Churn Status')
axes[i].set_xlabel("Churn (0=No, 1=Yes)")
# Remove the empty subplot
axes[5].remove()
plt.tight_layout()
plt.show()
3. Feature Engineering¶
Let's prepare our features for machine learning by scaling and encoding them appropriately.
# Separate features and target
X = df.drop("churn", axis=1)
y = df["churn"]
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures: {list(X.columns)}")
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTraining set churn rate: {y_train.mean():.2%}")
print(f"Test set churn rate: {y_test.mean():.2%}")
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrames for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns)
print("Features scaled successfully!")
print(f"\nScaled training features - Mean: {X_train_scaled.mean().mean():.4f}")
print(f"Scaled training features - Std: {X_train_scaled.std().mean():.4f}")
4. Model Training¶
Let's train multiple machine learning models and compare their performance.
# Initialize models
models = {
"Logistic Regression": LogisticRegression(random_state=42),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"SVM": SVC(kernel="rbf", random_state=42),
}
# Train models and store results
model_results = {}
trained_models = {}
print("Training models...\n")
for name, model in models.items():
print(f"Training {name}...")
# Train the model
model.fit(X_train_scaled, y_train)
trained_models[name] = model
# Cross-validation score
cv_scores = cross_val_score(
model, X_train_scaled, y_train, cv=5, scoring="accuracy"
)
# Predictions
y_pred = model.predict(X_test_scaled)
# Store results
model_results[name] = {
"cv_mean": cv_scores.mean(),
"cv_std": cv_scores.std(),
"test_accuracy": accuracy_score(y_test, y_pred),
"predictions": y_pred,
}
print(f" CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f" Test Accuracy: {accuracy_score(y_test, y_pred):.4f}\n")
print("All models trained successfully!")
5. Model Evaluation¶
Let's evaluate our models using various metrics and visualizations.
# Compare model performance
results_df = pd.DataFrame(model_results).T
results_df = results_df[["cv_mean", "cv_std", "test_accuracy"]]
results_df.columns = ["CV_Mean", "CV_Std", "Test_Accuracy"]
print("Model Performance Comparison:")
print(results_df.round(4))
# Visualize model comparison
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
x_pos = np.arange(len(results_df))
bars = ax.bar(
x_pos,
results_df["Test_Accuracy"],
yerr=results_df["CV_Std"],
alpha=0.7,
color=["lightblue", "lightgreen", "lightcoral"],
capsize=5,
)
ax.set_xlabel("Models")
ax.set_ylabel("Accuracy")
ax.set_title("Model Performance Comparison")
ax.set_xticks(x_pos)
ax.set_xticklabels(results_df.index, rotation=45)
ax.set_ylim(0, 1)
ax.grid(True, alpha=0.3)
# Add value labels on bars
for i, bar in enumerate(bars):
height = bar.get_height()
ax.text(
bar.get_x() + bar.get_width() / 2.0,
height + 0.01,
f"{height:.3f}",
ha="center",
va="bottom",
)
plt.tight_layout()
plt.show()
# Detailed evaluation for the best model
best_model_name = results_df["Test_Accuracy"].idxmax()
best_model = trained_models[best_model_name]
best_predictions = model_results[best_model_name]["predictions"]
print(f"Best Model: {best_model_name}")
print(f"Test Accuracy: {model_results[best_model_name]['test_accuracy']:.4f}")
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, best_predictions))
# Confusion matrix
cm = confusion_matrix(y_test, best_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(
cm,
annot=True,
fmt="d",
cmap="Blues",
xticklabels=["No Churn", "Churn"],
yticklabels=["No Churn", "Churn"],
)
plt.title(f"Confusion Matrix - {best_model_name}")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
# Calculate additional metrics
tn, fp, fn, tp = cm.ravel()
precision = tp / (tp + fp)
recall = tp / (tp + fn)
specificity = tn / (tn + fp)
f1 = 2 * (precision * recall) / (precision + recall)
print(f"\nAdditional Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"Specificity: {specificity:.4f}")
print(f"F1-Score: {f1:.4f}")
# Feature importance (for Random Forest)
if best_model_name == "Random Forest":
feature_importance = pd.DataFrame(
{"feature": X.columns, "importance": best_model.feature_importances_}
).sort_values("importance", ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x="importance", y="feature", palette="viridis")
plt.title("Feature Importance - Random Forest")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
print("Top 5 Most Important Features:")
print(feature_importance.head())
else:
print(f"Feature importance not available for {best_model_name}")
6. Predictions¶
Let's use our best model to make predictions on new data.
# Create new customer data for prediction
new_customers = pd.DataFrame(
{
"age": [25, 45, 65],
"income": [35000, 75000, 90000],
"tenure_months": [6, 24, 48],
"monthly_charges": [75.50, 55.20, 89.90],
"total_charges": [0.5, 1.2, -0.3],
"support_tickets": [2.1, 0.1, -1.5],
"satisfaction_score": [2.5, 4.2, 3.8],
"usage_hours": [1.2, -0.5, 0.8],
"contract_length": [-0.8, 0.2, 1.1],
"payment_method": [0.3, -0.9, 0.6],
}
)
print("New customers to predict:")
print(new_customers)
# Scale the new data using the same scaler
new_customers_scaled = scaler.transform(new_customers)
# Make predictions
predictions = best_model.predict(new_customers_scaled)
prediction_proba = best_model.predict_proba(new_customers_scaled)
# Create results DataFrame
prediction_results = new_customers.copy()
prediction_results["predicted_churn"] = predictions
prediction_results["churn_probability"] = prediction_proba[:, 1]
prediction_results["risk_level"] = pd.cut(
prediction_proba[:, 1], bins=[0, 0.3, 0.7, 1.0], labels=["Low", "Medium", "High"]
)
print("\nPrediction Results:")
display_cols = [
"age",
"income",
"tenure_months",
"satisfaction_score",
"predicted_churn",
"churn_probability",
"risk_level",
]
print(prediction_results[display_cols])
# Visualize prediction results
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Churn probability by customer
customer_ids = [f"Customer {i+1}" for i in range(len(new_customers))]
colors = ["green" if p < 0.5 else "red" for p in prediction_proba[:, 1]]
bars = axes[0].bar(customer_ids, prediction_proba[:, 1], color=colors, alpha=0.7)
axes[0].set_title("Churn Probability by Customer")
axes[0].set_xlabel("Customer")
axes[0].set_ylabel("Churn Probability")
axes[0].set_ylim(0, 1)
axes[0].axhline(
y=0.5, color="black", linestyle="--", alpha=0.7, label="Decision Threshold"
)
axes[0].legend()
# Add probability labels
for i, bar in enumerate(bars):
height = bar.get_height()
axes[0].text(
bar.get_x() + bar.get_width() / 2.0,
height + 0.02,
f"{height:.2f}",
ha="center",
va="bottom",
)
# Risk level distribution
risk_counts = prediction_results["risk_level"].value_counts()
axes[1].pie(
risk_counts.values,
labels=risk_counts.index,
autopct="%1.0f%%",
colors=["lightgreen", "orange", "lightcoral"],
)
axes[1].set_title("Risk Level Distribution")
plt.tight_layout()
plt.show()
7. Conclusions¶
Model Performance Summary¶
We successfully trained and compared three different machine learning models:
- Logistic Regression: Simple and interpretable baseline model
- Random Forest: Ensemble method with feature importance
- Support Vector Machine: Non-linear classification approach
Key Insights¶
- Best Model: The best performing model achieved good accuracy on the test set
- Feature Importance: Customer satisfaction and tenure appear to be important predictors
- Prediction Capability: The model can identify customers at risk of churning
Business Applications¶
- Customer Retention: Identify high-risk customers for targeted retention campaigns
- Resource Allocation: Focus retention efforts on customers most likely to churn
- Strategy Development: Use feature importance to understand key churn drivers
Next Steps¶
Model Improvement:
- Hyperparameter tuning
- Feature engineering
- Ensemble methods
Deployment:
- Model serialization
- Real-time prediction API
- Monitoring and retraining
Business Integration:
- Dashboard development
- Automated alerting
- A/B testing
This tutorial demonstrates a complete machine learning workflow. In production, consider more sophisticated validation techniques, model monitoring, and continuous learning approaches.