{"id":25408,"date":"2025-12-11T09:22:57","date_gmt":"2025-12-11T09:22:57","guid":{"rendered":"https:\/\/gtracademy.org\/?p=25408"},"modified":"2025-12-12T05:54:29","modified_gmt":"2025-12-12T05:54:29","slug":"train-validation-test-splits-and-data-leakage-in-practice","status":"publish","type":"post","link":"https:\/\/gtracademy.org\/staging\/train-validation-test-splits-and-data-leakage-in-practice\/","title":{"rendered":"Train\/ Validation\/ Test Splits and Data Leakage in Practice"},"content":{"rendered":"<p><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-medium wp-image-25465\" src=\"https:\/\/gtracademy.org\/wp-content\/uploads\/2025\/12\/GTR_Test_Validate_Train_2025-300x168.png\" alt=\"\" width=\"300\" height=\"168\" srcset=\"https:\/\/gtracademy.org\/staging\/wp-content\/uploads\/2025\/12\/GTR_Test_Validate_Train_2025-300x168.png 300w, https:\/\/gtracademy.org\/staging\/wp-content\/uploads\/2025\/12\/GTR_Test_Validate_Train_2025-768x429.png 768w, https:\/\/gtracademy.org\/staging\/wp-content\/uploads\/2025\/12\/GTR_Test_Validate_Train_2025.png 800w\" sizes=\"(max-width: 300px) 100vw, 300px\" \/>A\u200b\u200d\u200b\u200c\u200d\u200b\u200d\u200c\u200b\u200d\u200d\u200b\u200d\u200c machine learning model&#8217;s effectiveness is heavily reliant on how well its evaluation is conducted. Numerous &#8220;excellent&#8221; models fail in production because of inadequate data splitting and unnoticed data leakage. This article outlines ways to organize the train\/validation\/test splits and illustrates what data leakage looks like in actual projects.<\/p>\n<p><strong>Why splits matter<\/strong><\/p>\n<p>If you train and test your model on the same dataset, you will get an inaccurate notion of the model&#8217;s correctness that usually fails when you apply it to real-world data. Splits are there to prevent this from happening as they keep aside portions of data that the model is not allowed to see during training.<\/p>\n<ul>\n<li><strong>Training set:<\/strong> It is employed to adjust the parameters of the model.<\/li>\n<li><strong>Validation set:<\/strong> It is utilized to tune hyperparameters, compare models, and make design choices.<\/li>\n<li><strong>Test set:<\/strong> It is used once at the end to provide an estimate of the true performance on the unseen data.<\/li>\n<\/ul>\n<p>The typical initial point can be 70\u201380% train, 10\u201315% validation, and 10\u201315% test, with the changes made according to the size of the dataset.<\/p>\n<p><strong>Practical splitting strategies<\/strong><\/p>\n<p>The way you split the data can have a bigger impact than the exact values of the percentages.<\/p>\n<ul>\n<li><strong>Random split:<\/strong> It is suitable for many tabular problems where rows are independent and identically distributed.<\/li>\n<li><strong>Stratified split:<\/strong> It guarantees the same class proportions in each split for imbalanced classification (e.g., fraud vs non-fraud).<\/li>\n<li><strong>Time-based split:<\/strong> For time series or any temporal data, always train on the past and validate\/test on the future to simulate the actual deployment.<\/li>\n<li><strong>Grouped split:<\/strong> When multiple rows correspond to the same entity (user, patient, device), make sure each entity is in only one split so that you do not &#8220;peek&#8221; at the same entity in both train and test.<\/li>\n<\/ul>\n<p>In the case of small datasets or when numerous modelling decisions are made, k-fold cross-validation on the training data is often used instead of a single validation split to obtain more reliable results.<\/p>\n<p><strong>What is data leakage?<\/strong><\/p>\n<p>Data leakage is the situation when information from outside the training process\u2014most often from validation or test sets\u2014unexpectedly gets to model training. That causes extremely optimistic validation\/test scores which are far from the real-world performance.<\/p>\n<p>Major forms include:<\/p>\n<ul>\n<li><strong>Target leakage:<\/strong> Features hold direct or indirect information about the target that would not be available at prediction time.<\/li>\n<li><strong>Train\u2013test contamination:<\/strong> Validation or test data is used for preprocessing, feature engineering, or model fitting.<\/li>\n<\/ul>\n<p>Leakage is among the most frequent and serious errors which are made in ML pipelines in practice.<\/p>\n<p><strong>Concrete leakage examples<\/strong><\/p>\n<p>Here are the scenarios you can refer to in the blog with distinct &#8220;wrong vs right&#8221; patterns.<\/p>\n<ol>\n<li>Preprocessing on the whole dataset<\/li>\n<\/ol>\n<ul>\n<li>Wrong: You scale, encode, or impute by utilizing the entire dataset and then perform splits.<\/li>\n<li>Why it leaks: The values for mean, standard deviation, or category frequencies now have information from validation\/test data.<\/li>\n<li>Right: Always split first. Do the fitting of the preprocessing only on the training set, then use those fitted transformers for validation and test.<\/li>\n<\/ul>\n<ol start=\"2\">\n<li>Using future information in features<\/li>\n<\/ol>\n<ul>\n<li>Wrong: To predict churn at the beginning of January, you use &#8220;total tickets in January&#8221; as a feature.<\/li>\n<li>Why it leaks: That info is not there for the prediction time; so it is basically cheating with future data.<\/li>\n<li>Right: Restrict feature construction to the data that is available up to the prediction time and also splitting should be done chronologically.<\/li>\n<\/ul>\n<ol start=\"3\">\n<li>Entity split mistakes<\/li>\n<\/ol>\n<ul>\n<li><strong>Wrong:<\/strong> The driver-behavior. model where one driver is found in both train and test.<\/li>\n<li><strong>Why it leaks:<\/strong> The model identifies driver-specific patterns that result in an artificially high test performance.<\/li>\n<li><strong>Right:<\/strong> Use a grouped or hash-based strategy to split by driver ID thus each driver will be in only one split.<\/li>\n<\/ul>\n<ol start=\"4\">\n<li>Target encoding with leakage<\/li>\n<\/ol>\n<ul>\n<li><strong>Wrong:<\/strong> Calculate mean target per category by using the entire dataset before splitting or without proper cross-validation.<\/li>\n<li><strong>Why it leaks:<\/strong> Encodings indirectly get the labels of the validation\/test rows.<\/li>\n<li><strong>Right:<\/strong> Do target encodings in cross-validation folds or by using only training data and then apply them to validation\/test.<\/li>\n<\/ul>\n<p>A safe pipeline pattern<\/p>\n<p>A pattern that is very difficult to break which you might explain in the article:<\/p>\n<ol>\n<li><strong>Collect all data into two parts:<\/strong> train and test (and possibly train\/validation\/test or train + k-fold CV on the train).<\/li>\n<li><strong>Create a pipeline object comprising<\/strong> preprocessing + the model (for example, in scikit-learn).<\/li>\n<li><strong>Only training data\/folds<\/strong> pipeline fitting; it will manage the fitting of preprocessing steps properly.<\/li>\n<li><strong>Hyperparameter tuning<\/strong> on validation or by cross-validation.<\/li>\n<li>After the final decision, performance is evaluated on the test set that has been left completely aside just once.<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>A\u200b\u200d\u200b\u200c\u200d\u200b\u200d\u200c\u200b\u200d\u200d\u200b\u200d\u200c machine learning model&#8217;s effectiveness is heavily reliant on how well its evaluation is conducted. Numerous &#8220;excellent&#8221; models fail in production because of inadequate data splitting and unnoticed data leakage. This article outlines ways to organize the train\/validation\/test splits and illustrates what data leakage looks like in actual projects. Why splits matter If you train&#8230;<\/p>\n","protected":false},"author":11,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kad_post_transparent":"default","_kad_post_title":"default","_kad_post_layout":"default","_kad_post_sidebar_id":"","_kad_post_content_style":"default","_kad_post_vertical_padding":"default","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[1],"tags":[2667,1448,2668,2661,2666,2338,2665,2662,2663,2664],"class_list":["post-25408","post","type-post","status-publish","format-standard","hentry","category-machine-learning","tag-ai","tag-artificial-intelligence-and-data-science","tag-data-leakage","tag-data-validation","tag-dl","tag-machine-learning","tag-ml","tag-testing","tag-training","tag-validation"],"_links":{"self":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/posts\/25408","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/users\/11"}],"replies":[{"embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/comments?post=25408"}],"version-history":[{"count":0,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/posts\/25408\/revisions"}],"wp:attachment":[{"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/media?parent=25408"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/categories?post=25408"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/gtracademy.org\/staging\/wp-json\/wp\/v2\/tags?post=25408"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}