Most real datasets are imbalanced: 99% “normal” transactions, 1% fraud. Standard accuracy lies (99% by predicting all normal). Here’s how to build models that work when classes aren’t equal.
Why imbalance breaks Machine Learning
Problems:
- Models ignore rare class (easy 99% accuracy).
- Threshold at 0.5 biases toward majority.
- Evaluation metrics hide poor minority performance.
Solution domains: Resampling, cost‑sensitive learning, better metrics.
Method 1: Resampling strategies
Undersampling: Remove majority samples → balanced but less data.
Oversampling: Duplicate minority → overfitting risk.
SMOTE (Synthetic Minority Oversampling):
- Find k nearest minority neighbors.
- Generate synthetic samples along line segments.
- Preserves local structure better than duplication.
Method 2: Algorithm tweaks
Class weights: Penalize majority errors more.
sklearn: class_weight=’balanced’
XGBoost: scale_pos_weight = neg/pos ratio
Ensemble: Undersample /boost on different splits.
Method 3: Threshold Tuning + Metrics
Key metrics:
- Precision/Recall trade‑off (PR curve > ROC for imbalance).
- F1 score: Harmonic mean, punishes imbalance.
- AUC‑PR: Area under precision‑recall curve.
Tune threshold on validation for business cost (FP vs FN).
Example: Detection of Frauds
Dataset: 98% normal, 2% fraud.
Baseline: Predict all normal → Accuracy 98%, Recall 0%
Class weights → Recall 75%, Precision 60%
SMOTE + threshold → Recall 85%, Precision 55%
Pick based on cost: $100 FN vs $10 FP.
Try this: Grab a fraud/credit dataset. Fit 3 models: baseline, class weights, SMOTE. Plot PR curves.

