Stratified Split
Stratified split is a data sampling technique used in machine learning and statistics to divide a dataset into training and testing subsets while preserving the proportional distribution of target classes or categories. It ensures that each subset maintains the same class balance as the original dataset, which is crucial for building unbiased models and obtaining reliable performance metrics. This method is commonly applied in classification tasks where class imbalance could skew model evaluation.
Developers should use stratified split when working with imbalanced datasets in classification problems, such as fraud detection, medical diagnosis, or sentiment analysis, to prevent overfitting to majority classes and ensure representative evaluation. It is essential during model validation phases like cross-validation to maintain consistent class distributions across folds, leading to more accurate estimates of model performance and better generalization to unseen data.