Analyzing Model Stability and Generalization under Distribution Shift in Real-World Machine Learning Applications
DOI:
https://doi.org/10.59563/djtech.v6i2.367Keywords:
Distribution Shift, Model Stability, Generalization, Gradient Boosting, UCI Adult DatasetAbstract
Machine learning models deployed in operational settings rarely encounter data identically distributed to their training set. Shifts in population composition, measurement processes, and sampling frames routinely cause performance degradation, undermining both accuracy and trust. This study empirically examines model stability and generalization under controlled distribution shift using the UCI Adult/Census Income dataset (48,842 records, 14 features). Four representative classifiers-Logistic Regression, XGBoost, LightGBM, and CatBoost-were trained and evaluated across three scenarios: an in-distribution stratified random split, a demographic shift in which the model is trained on individuals under 40 years old and tested on those aged 40 and above, and a structural subpopulation shift in which the model is trained on non-degree holders and tested on degree holders. Contrary to the conventional expectation that distribution shift monotonically degrades performance, the empirical F1-score results reveal a more nuanced picture: all four classifiers actually achieved higher F1-scores on the education-shifted test set than on the in-distribution baseline, with Logistic Regression gaining +0.145 F1 points. This counter-intuitive outcome is driven by the increased positive-class prior in the shifted target distributions. When stability is operationalized as the signed average F1 change (with rank 1 assigned to the smallest, i.e. most negative, value), Logistic Regression ranked first (average change −0.076), followed by CatBoost (−0.016), LightGBM (−0.013), and XGBoost (−0.013); we show, however, that under the operationally meaningful absolute-change criterion this ordering reverses and the gradient boosting models are the most stable. However, accuracy tells a contrasting story: Logistic Regression's accuracy fell by 16.4 percentage points under the age shift, whereas the gradient boosting models retained accuracy above 0.81. These findings demonstrate that single-metric stability evaluation is misleading and that shift robustness must be characterized through multiple complementary metrics.
References
Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2022). Dataset Shift in Machine Learning (reprint edition). MIT Press. https://doi.org/10.7551/mitpress/9780262545877.001.0001
Finlayson, S. G., Subbaswamy, A., Singh, K., Bowers, J., Kupersmith, A., Zittrain, J., Kale, D. C., Beam, A. L., & Saria, S. (2021). The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3), 283–286. https://doi.org/10.1056/NEJMc2104626
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363. https://doi.org/10.1109/TKDE.2018.2876857
Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., … Liang, P. (2021). WILDS: A benchmark of in-the-wild distribution shifts. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 5637–5664. https://doi.org/10.48550/arXiv.2012.07421
Becker, B., & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20
Ding, F., Hardt, M., Miller, J., & Schmidt, L. (2021). Retiring Adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 6478–6490. https://doi.org/10.48550/arXiv.2108.04884
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154. https://doi.org/10.5555/3294996.3295074
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6639–6649. https://doi.org/10.48550/arXiv.1706.09516
Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90. https://doi.org/10.1016/j.inffus.2021.11.011
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35, 507–520. https://doi.org/10.48550/arXiv.2207.08815
Sahiner, B., Chen, W., Samala, R. K., & Petrick, N. (2023). Data drift in medical machine learning: Implications and potential remedies. British Journal of Radiology, 96(1150), 20220878. https://doi.org/10.1259/bjr.20220878
Wardani, N. K., Arpin, R. M., & Hidayat, M. A. (2022). Rancang Bangun Modul Dioda and Rectifier. Dewantara Journal of Technology, 3(1), 1–4. https://jurnal.atidewantara.ac.id/index.php/djtech
Mahyuddin, R., Dani, A. A. H., & Paembonan, S. (2022). Sistem Informasi Data UMKM Berbasis Website di PLUT-KUMKM Kota Palopo. Dewantara Journal of Technology, 3(1), 82–91. https://jurnal.atidewantara.ac.id/index.php/djtech
Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32, 1396–1408. https://doi.org/10.48550/arXiv.1810.11953
Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2020). Distributionally robust neural networks for group shifts. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1911.08731
Santurkar, S., Tsipras, D., & Madry, A. (2021). BREEDS: Benchmarks for subpopulation shift. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2008.04859
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673. https://doi.org/10.1038/s42256-020-00257-z
Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2022). Fine-tuning can distort pretrained features and underperform out-of-distribution. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2202.10054
Garg, S., Wu, Y., Balakrishnan, S., & Lipton, Z. C. (2020). A unified view of label shift estimation. Advances in Neural Information Processing Systems, 33, 3290–3300. https://doi.org/10.48550/arXiv.2003.07554
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2019). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://doi.org/10.5555/1953048.2078195
Yao, H., Choi, C., Cao, B., Lee, Y., Koh, P. W., & Finn, C. (2022). Wild-Time: A benchmark of in-the-wild distribution shift over time. Advances in Neural Information Processing Systems, 35, 10309–10324. https://doi.org/10.48550/arXiv.2211.14238







