Abstract



Higher education institutions are regarded as indicators of countries' economic and social development. The dropout or failure of students from higher education institutions due to various reasons not only affects the reputation of these institutions but also poses significant problems for students, their families, and society in general. Therefore, predicting students at risk of dropping out is considered crucial. This study primarily aims to predict students at risk of dropping out from higher education institutions using the Random Forest method, which is among the educational data mining techniques. Secondly, it aims to compare the classification performance of the method based on sample size. For this purpose, a dataset from the Kaggle database, created to reduce academic failure and dropout rates in higher education, was used. This dataset includes data on students' enrollment information and their demographic and socioeconomic status. The dataset comprises 4424 samples with 37 variables, one of which is the dependent variable. Random samples of 500, 1000, 2000, 3000, and 4000 were drawn from the dataset. Analyses were conducted using an open-source Python-based program. The AUC, accuracy, F1, precision, and recall metrics were used to measure the classification performance of the method. The performance criteria were found as follows: AUC: 0.961, accuracy: 0.881, F1: 0.878, precision: 0.879, and recall: 0.881. The analysis results indicate that the method shows better classification performance with a sample size of 4000. Additionally, it was determined that classification success increases with the sample size. The most significant variable across all sample sizes is "Curricular units 2nd sem (approved)". It is recommended to examine the conditions of different data mining methods concerning student dropout and failure under varying conditions.



Keywords

Dropout, higher education, sample size, random forest, educational data mining





References