Analyzing Model Stability and Generalization under Distribution Shift in Real-World Machine Learning Applications

Authors

  • Muhammad Abrar Rayhan Telkom University
  • Citra Feby Dermawaty Manik Telkom University
  • Angga Prasetya Putra Telkom University

DOI:

https://doi.org/10.59563/djtech.v6i2.367

Keywords:

Distribution Shift, Model Stability, Generalization, Gradient Boosting, UCI Adult Dataset

Abstract

Machine learning models deployed in operational settings rarely encounter data identically distributed to their training set. Shifts in population composition, measurement processes, and sampling frames routinely cause performance degradation, undermining both accuracy and trust. This study empirically examines model stability and generalization under controlled distribution shift using the UCI Adult/Census Income dataset (48,842 records, 14 features). Four representative classifiers-Logistic Regression, XGBoost, LightGBM, and CatBoost-were trained and evaluated across three scenarios: an in-distribution stratified random split,  a demographic shift in which the model is trained on individuals under 40 years old and tested on those aged 40 and above, and a structural subpopulation shift in which the model is trained on non-degree holders and tested on degree holders. Contrary to the conventional expectation that distribution shift monotonically degrades performance, the empirical F1-score results reveal a more nuanced picture: all four classifiers actually achieved higher F1-scores on the education-shifted test set than on the in-distribution baseline, with Logistic Regression gaining +0.145 F1 points. This counter-intuitive outcome is driven by the increased positive-class prior in the shifted target distributions. When stability is operationalized as the signed average F1 change (with rank 1 assigned to the smallest, i.e. most negative, value), Logistic Regression ranked first (average change −0.076), followed by CatBoost (−0.016), LightGBM (−0.013), and XGBoost (−0.013); we show, however, that under the operationally meaningful absolute-change criterion this ordering reverses and the gradient boosting models are the most stable. However, accuracy tells a contrasting story: Logistic Regression's accuracy fell by 16.4 percentage points under the age shift, whereas the gradient boosting models retained accuracy above 0.81. These findings demonstrate that single-metric stability evaluation is misleading and that shift robustness must be characterized through multiple complementary metrics.

References

Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2022). Dataset Shift in Machine Learning (reprint edition). MIT Press. https://doi.org/10.7551/mitpress/9780262545877.001.0001

Finlayson, S. G., Subbaswamy, A., Singh, K., Bowers, J., Kupersmith, A., Zittrain, J., Kale, D. C., Beam, A. L., & Saria, S. (2021). The clinician and dataset shift in artificial intelligence. New England Journal of Medicine, 385(3), 283–286. https://doi.org/10.1056/NEJMc2104626

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363. https://doi.org/10.1109/TKDE.2018.2876857

Koh, P. W., Sagawa, S., Marklund, H., Xie, S. M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R. L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B. A., Haque, I. S., Beery, S., Leskovec, J., Kundaje, A., … Liang, P. (2021). WILDS: A benchmark of in-the-wild distribution shifts. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 5637–5664. https://doi.org/10.48550/arXiv.2012.07421

Becker, B., & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20

Ding, F., Hardt, M., Miller, J., & Schmidt, L. (2021). Retiring Adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34, 6478–6490. https://doi.org/10.48550/arXiv.2108.04884

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3146–3154. https://doi.org/10.5555/3294996.3295074

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6639–6649. https://doi.org/10.48550/arXiv.1706.09516

Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81, 84–90. https://doi.org/10.1016/j.inffus.2021.11.011

Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35, 507–520. https://doi.org/10.48550/arXiv.2207.08815

Sahiner, B., Chen, W., Samala, R. K., & Petrick, N. (2023). Data drift in medical machine learning: Implications and potential remedies. British Journal of Radiology, 96(1150), 20220878. https://doi.org/10.1259/bjr.20220878

Wardani, N. K., Arpin, R. M., & Hidayat, M. A. (2022). Rancang Bangun Modul Dioda and Rectifier. Dewantara Journal of Technology, 3(1), 1–4. https://jurnal.atidewantara.ac.id/index.php/djtech

Mahyuddin, R., Dani, A. A. H., & Paembonan, S. (2022). Sistem Informasi Data UMKM Berbasis Website di PLUT-KUMKM Kota Palopo. Dewantara Journal of Technology, 3(1), 82–91. https://jurnal.atidewantara.ac.id/index.php/djtech

Rabanser, S., Günnemann, S., & Lipton, Z. C. (2019). Failing loudly: An empirical study of methods for detecting dataset shift. Advances in Neural Information Processing Systems, 32, 1396–1408. https://doi.org/10.48550/arXiv.1810.11953

Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2020). Distributionally robust neural networks for group shifts. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.1911.08731

Santurkar, S., Tsipras, D., & Madry, A. (2021). BREEDS: Benchmarks for subpopulation shift. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2008.04859

Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673. https://doi.org/10.1038/s42256-020-00257-z

Kumar, A., Raghunathan, A., Jones, R., Ma, T., & Liang, P. (2022). Fine-tuning can distort pretrained features and underperform out-of-distribution. International Conference on Learning Representations. https://doi.org/10.48550/arXiv.2202.10054

Garg, S., Wu, Y., Balakrishnan, S., & Lipton, Z. C. (2020). A unified view of label shift estimation. Advances in Neural Information Processing Systems, 33, 3290–3300. https://doi.org/10.48550/arXiv.2003.07554

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2019). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://doi.org/10.5555/1953048.2078195

Yao, H., Choi, C., Cao, B., Lee, Y., Koh, P. W., & Finn, C. (2022). Wild-Time: A benchmark of in-the-wild distribution shift over time. Advances in Neural Information Processing Systems, 35, 10309–10324. https://doi.org/10.48550/arXiv.2211.14238

Downloads

Published

2026-05-31

How to Cite

[1]
M. A. Rayhan, C. F. D. . Manik, and A. P. . Putra, “Analyzing Model Stability and Generalization under Distribution Shift in Real-World Machine Learning Applications”, djtech, vol. 6, no. 2, pp. 78-89, May 2026.

Issue

Section

Dewantara Journal of Technology Volume 6 No 2