Stacking Ensemble Learning for University Student Dropout Prediction

Aden Nia Firdaus; Yoannes Romando Sipayung

doi:10.63158/journalisi.v8i1.1403

Authors

Aden Nia Firdaus Indonesia
Yoannes Romando Sipayung Indonesia

DOI:

https://doi.org/10.63158/journalisi.v8i1.1403

Keywords:

Stacking Ensemble Learning, Student Dropout Prediction, STEM Education, SMOTE–Tomek Links, Educational Data Mining

Abstract

Student dropout in STEM programs remains a persistent challenge for higher education institutions, reducing educational quality, weakening retention outcomes, and increasing inefficiencies in resource utilization. This study develops an interpretable Stacking Ensemble Learning approach to predict STEM student dropout risk and identify key academic and socioeconomic determinants that can support data-driven early intervention. Following the CRISP-DM framework, we analyze 3,630 student records from the UCI Machine Learning Repository containing demographic, academic, and socioeconomic attributes. The proposed stacking architecture combines Random Forest, Gradient Boosting, and XGBoost as base learners with Logistic Regression as a meta-learner, while SMOTE–Tomek Links is employed to address class imbalance and reduce boundary noise. Experimental results show that the model achieves strong predictive performance with 90.91% accuracy and ROC–AUC of 95.72%, demonstrating stable discrimination and outperforming individual base models. Feature importance analysis indicates that early academic trajectory variables—especially first- and second-semester success rates, total approved units, and average grades—are the most influential predictors of dropout risk. The proposed framework contributes a practical, interpretable early warning model by integrating stacking ensemble learning with imbalance handling and trajectory-based feature engineering, supporting actionable intervention planning in higher education.

Downloads

Download data is not yet available.

References

[1] M. Nagy and R. Molontay, “Interpretable Dropout Prediction: Towards XAI-Based Personalized Intervention,” Int. J. Artif. Intell. Educ., vol. 34, no. 2, pp. 274–300, Jun. 2024, doi: 10.1007/s40593-023-00331-8.

[2] S. Kim, E. Choi, Y. K. Jun, and S. Lee, “Student Dropout Prediction for University with High Precision and Recall,” Appl. Sci., vol. 13, no. 10, Art. no. 6275, May 2023, doi: 10.3390/app13106275.

[3] C. H. Cho, Y. W. Yu, and H. G. Kim, “A Study on Dropout Prediction for University Students Using Machine Learning,” Appl. Sci., vol. 13, no. 21, Art. no. 12004, Nov. 2023, doi: 10.3390/app132112004.

[4] T. Yoon and D. Kang, “Multi-Modal Stacking Ensemble for the Diagnosis of Cardiovascular Diseases,” J. Pers. Med., vol. 13, no. 2, Art. no. 373, Feb. 2023, doi: 10.3390/jpm13020373.

[5] M. Nascimento, A. C. C. Nascimento, C. F. Azevedo, A. C. B. de Oliveira, E. T. Caixeta, and D. Jarquin, “Enhancing genomic prediction with stacking ensemble learning in Arabica coffee,” Front. Plant Sci., vol. 15, Art. no. 1373318, 2024, doi: 10.3389/fpls.2024.1373318.

[6] J. Zheng, M. Wang, T. Yao, Y. Tang, and H. Liu, “Dynamic mechanical strength prediction of BFRC based on stacking ensemble learning and genetic algorithm optimization,” Buildings, vol. 13, no. 5, Art. no. 1155, May 2023, doi: 10.3390/buildings13051155.

[7] N. Doede, P. Merkel, M. Kriwall, M. Stonis, and B. A. Behrens, “Implementation of an intelligent process monitoring system for screw presses using the CRISP-DM standard,” Prod. Eng., 2024, doi: 10.1007/s11740-024-01298-8.

[8] A. M. Shimaoka, R. C. Ferreira, and A. Goldman, “The evolution of CRISP-DM for data science: Methods, processes and frameworks,” SBC Rev. Comput. Sci., vol. 4, no. 1, pp. 28–43, Oct. 2024, doi: 10.5753/reviews.2024.3757.

[9] C. Schröer, F. Kruse, and J. M. Gómez, “A systematic literature review on applying CRISP-DM process model,” Procedia Comput. Sci., vol. 181, pp. 526–534, 2021, doi: 10.1016/j.procs.2021.01.199.

[10] E. Hakim and A. Muklason, “Analysis of employee work stress using CRISP-DM to reduce work stress on reasons for employee resignation,” Data Sci. J. Comput. Appl. Inform., vol. 8, no. 2, pp. 75–87, 2024, doi: 10.32734/jocai.v8.i2.

[11] V. Realinho, J. Machado, L. Baptista, and M. V. Martins, “Predicting student dropout and academic success,” Data, vol. 7, no. 11, Art. no. 146, 2022, doi: 10.3390/data7110146.

[12] A. Y. Wang, W. Epperson, R. A. Deline, and S. M. Drucker, “Diff in the Loop: Supporting Data Comparison in Exploratory Data Analysis,” in Proc. CHI Conf. Hum. Factors Comput. Syst., New Orleans, LA, USA, Apr. 2022, doi: 10.1145/3491102.3502123.

[13] M. B. Courtney, “Exploratory data analysis in schools: A logic model to guide implementation,” Int. J. Educ. Policy Leadersh., vol. 17, no. 4, May 2021, doi: 10.22230/ijepl.2021v17n4a1041.

[14] S. Marlia et al., “Analysis of music features and song popularity trends on Spotify using K-Means and CRISP-DM,” Sistemasi, 2024.

[15] M. Mujahid et al., “Data oversampling and imbalanced datasets: An investigation of performance for machine learning and feature engineering,” J. Big Data, vol. 11, no. 1, 2024, doi: 10.1186/s40537-024-00943-4.

[16] R. Joeres, D. B. Blumenthal, and O. V. Kalinina, “DataSAIL: Data splitting against information leakage,” bioRxiv, Nov. 17, 2023, doi: 10.1101/2023.11.15.566305.

[17] Q. H. Nguyen et al., “Influence of data splitting on performance of machine learning models in prediction of shear strength of soil,” Math. Probl. Eng., vol. 2021, Art. no. 4832864, 2021, doi: 10.1155/2021/4832864.

[18] Y. Zhang, L. Deng, and B. Wei, “Imbalanced data classification based on improved Random-SMOTE and feature standard deviation,” Mathematics, vol. 12, no. 11, Art. no. 1709, Jun. 2024, doi: 10.3390/math12111709.

[19] H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement performance of the random forest method on unbalanced diabetes data classification using SMOTE-Tomek link,” Int. J. Inform. Vis., 2023.

[20] J. Niyogisubizo, L. Liao, E. Nziyumva, E. Murwanashyaka, and P. C. Nshimyumukiza, “Predicting student’s dropout in university classes using two-layer ensemble machine learning approach: A novel stacked generalization,” Comput. Educ. Artif. Intell., vol. 3, Art. no. 100066, 2022, doi: 10.1016/j.caeai.2022.100066.

[21] M. Nascimento et al., “Enhancing genomic prediction with stacking ensemble learning in Arabica coffee,” Front. Plant Sci., vol. 15, Art. no. 1373318, 2024, doi: 10.3389/fpls.2024.1373318.

[22] S. Sathyanarayanan, “Confusion matrix-based performance evaluation metrics,” Afr. J. Biomed. Res., vol. 27, no. 4S, pp. 4023–4031, Nov. 2024, doi: 10.53555/ajbr.v27i4s.4345.

[23] E. K. Anku and H. O. Duah, “Predicting and identifying factors associated with undernutrition among children under five years in Ghana using machine learning algorithms,” PLoS One, vol. 19, no. 2, Feb. 2024, doi: 10.1371/journal.pone.0296625.

[24] D. Chicco and G. Jurman, “The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification,” BioData Min., vol. 16, no. 1, 2023, doi: 10.1186/s13040-023-00322-4.

[25] A. Gupta, V. Jain, and A. Singh, “Stacking ensemble-based intelligent machine learning model for predicting post-COVID-19 complications,” New Gener. Comput., vol. 40, no. 4, pp. 987–1007, Dec. 2022, doi: 10.1007/s00354-021-00144-0.

Stacking Ensemble Learning for University Student Dropout Prediction

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

publisher

sidebar

certificate

template

gs-citation

index

stat