{"id":29645,"date":"2026-01-20T18:37:05","date_gmt":"2026-01-20T18:37:05","guid":{"rendered":"https:\/\/gtracademy.org\/?p=29645"},"modified":"2026-01-20T18:37:05","modified_gmt":"2026-01-20T18:37:05","slug":"handling-imbalanced-data-smote-class-weights-and-better-metrics","status":"publish","type":"post","link":"https:\/\/gtracademy.org\/staging\/handling-imbalanced-data-smote-class-weights-and-better-metrics\/","title":{"rendered":"Handling Imbalanced Data: SMOTE, Class Weights, and Better Metrics"},"content":{"rendered":"<p>Most real datasets are imbalanced: 99% &#8220;normal&#8221; transactions, 1% fraud. Standard accuracy lies (99% by predicting all normal). Here&#8217;s how to build models that work when classes aren&#8217;t equal.\u200b<\/p>\n<p><strong>Why imbalance breaks Machine Learning<\/strong><\/p>\n<p>Problems:<\/p>\n<ul>\n<li>Models ignore rare class (easy 99% accuracy).<\/li>\n<li>Threshold at 0.5 biases toward majority.<\/li>\n<li>Evaluation metrics hide poor minority performance.\u200b<\/li>\n<\/ul>\n<p>Solution domains:\u00a0Resampling, cost\u2011sensitive learning, better metrics.<\/p>\n<p><strong>Method 1: Resampling strategies<\/strong><\/p>\n<p>Undersampling:\u00a0Remove majority samples \u2192 balanced but less data.<br \/>\nOversampling:\u00a0Duplicate minority \u2192 overfitting risk.<\/p>\n<p>SMOTE (Synthetic Minority Oversampling):<\/p>\n<ul>\n<li>Find k nearest minority neighbors.<\/li>\n<li>Generate synthetic samples along line segments.<\/li>\n<li>Preserves local structure better than duplication.\u200b<\/li>\n<\/ul>\n<p><strong>Method 2: Algorithm tweaks<\/strong><\/p>\n<p>Class weights:\u00a0Penalize majority errors more.<\/p>\n<p>sklearn: class_weight=&#8217;balanced&#8217;<\/p>\n<p>XGBoost: scale_pos_weight = neg\/pos ratio<\/p>\n<p>Ensemble: Undersample \/boost on different splits.<\/p>\n<p><strong>Method 3: Threshold Tuning + Metrics<\/strong><\/p>\n<p>Key metrics:<\/p>\n<ul>\n<li>Precision\/Recall trade\u2011off\u00a0(PR curve &gt; ROC for imbalance).<\/li>\n<li>F1 score:\u00a0Harmonic mean, punishes imbalance.<\/li>\n<li>AUC\u2011PR:\u00a0Area under precision\u2011recall curve.<\/li>\n<\/ul>\n<p>Tune threshold on validation for business cost (FP vs FN).\u200b<\/p>\n<p><strong>Example: Detection of Frauds<\/strong><\/p>\n<p>Dataset: 98% normal, 2% fraud.<\/p>\n<p>Baseline: Predict all normal \u2192 Accuracy 98%, Recall 0%<\/p>\n<p>Class weights \u2192 Recall 75%, Precision 60%<\/p>\n<p>SMOTE + threshold \u2192 Recall 85%, Precision 55%<\/p>\n<p>Pick based on cost: $100 FN vs $10 FP.<\/p>\n<p>Try this:\u00a0Grab a fraud\/credit dataset. Fit 3 models: baseline, class weights, SMOTE. Plot PR curves.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most real datasets are imbalanced: 99% &#8220;normal&#8221; transactions, 1% fraud. Standard accuracy lies (99% by predicting all normal). Here&#8217;s how to build models that work when classes aren&#8217;t equal.\u200b Why imbalance breaks Machine Learning Problems: Models ignore rare class (easy 99% accuracy). Threshold at 0.5 biases toward majority. Evaluation metrics hide poor minority performance.\u200b Solution&#8230;<\/p>\n","protected":false},"author":11,"featured_media":29647,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_kad_post_transparent":"default","_kad_post_title":"default","_kad_post_layout":"default","_kad_post_sidebar_id":"","_kad_post_content_style":"default","_kad_post_vertical_padding":"default","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[792,1427,1],"tags":[3792,3795,3794,2338,3791,3793,3790],"class_list":["post-29645","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-analytics","category-data-science","category-machine-learning","tag-algorithm-tweaks","tag-better-metrics","tag-class-weights","tag-machine-learning","tag-oversampling","tag-smote","tag-undersampling"],"acf":[],"_links":{"self":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/posts\/29645","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/comments?post=29645"}],"version-history":[{"count":0,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/posts\/29645\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/media\/29647"}],"wp:attachment":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/media?parent=29645"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/categories?post=29645"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/tags?post=29645"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}