The Feature Selection vs Dimensionality Reduction for Steam Game Metadata Classification: An Ensemble Learning Study
DOI:
https://doi.org/10.63158/journalisi.v8i1.1456Keywords:
CatBoost, Dimensionality Reduction, Feature Selection, Binary Classification, Steam MetadataAbstract
Optimizing noisy Steam game metadata is essential for accurate binary classification. This study compares feature selection (MI) and dimensionality reduction (PCA, LDA) using a dataset of 55,144 Steam reviews and four ensemble algorithms, evaluated through Stratified 5-Fold Cross-Validation. The results show that the 125-feature baseline achieved the highest accuracy of 0.7728 with CatBoost. Feature selection (FS_10) maintained competitive performance with an accuracy of 0.7449, while LDA, after optimization, achieved 0.7281. In contrast, PCA significantly hindered performance (0.6963) due to the inability of linear transformations to preserve the discriminative power of one-hot encoded categorical features, which ensemble models handle better in their original form. These findings highlight the importance of preserving original features, especially in low-to-medium dimensional metadata, where feature selection outperforms dimensionality reduction in maintaining predictive integrity. High accuracy is crucial for developers to track product reception and for platforms to improve recommendation systems that influence user purchasing decisions. The study concludes that for Steam game metadata, strategic feature selection is superior to dimensionality reduction for maintaining classification performance.
Downloads
References
[1] M. D. Purbolaksono, “Sentiment analysis of game review in Steam platform using Random Forest,” Int. J. Inf. Commun. Technol., vol. 10, no. 2, pp. 161–169, Dec. 2024, doi: 10.21108/ijoict.v10i2.1007.
[2] A. A. Soetasad and E. Fernando, “Comparison of machine learning and deep learning algorithms on sentiment analysis in game reviews,” Int. J. Innov. Res. Sci. Stud., vol. 8, no. 7, pp. 64–71, Oct. 2025, doi: 10.53894/ijirss.v8i7.10401.
[3] Y. Meng, N. Yang, Z. Qian, and G. Zhang, “What makes an online review more helpful: An interpretation framework using XGBoost and SHAP values,” J. Theor. Appl. Electron. Commer. Res., vol. 16, no. 3, pp. 466–490, 2021, doi: 10.3390/jtaer16030029.
[4] R. Egger and J. Yu, “A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts,” Front. Sociol., vol. 7, May 2022, doi: 10.3389/fsoc.2022.886498.
[5] J. Varghese and P. T. Selvan, “Feature reduction based LDA with SVM classification on dimensionality reduction for big data,” Int. J. Health Sci. (Qassim), pp. 9415–9431, May 2022, doi: 10.53730/ijhs.v6ns2.7461.
[6] T. Rajendran et al., “Optimizing prediction accuracy in high-dimensional data: Comparative analysis of feature selection methods with Naive Bayes algorithm,” SSRG Int. J. Electron. Commun. Eng., vol. 11, no. 3, pp. 41–52, Mar. 2024, doi: 10.14445/23488549/IJECE-V11I3P105.
[7] M. Buyukkececi and M. C. Okur, “A comprehensive review of feature selection and feature selection stability in machine learning,” Dec. 01, 2023, Gazi Universitesi. doi: 10.35378/gujs.993763.
[8] F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),” May 01, 2021, Elsevier Ireland Ltd. doi: 10.1016/j.cosrev.2021.100378.
[9] S. Feng and H. Wang, “Comparison of PCA and LDA dimensionality reduction algorithms based on wine dataset,” in Proc. 33rd Chinese Control Decision Conf., CCDC 2021, Beijing: IEEE, Nov. 2021, pp. 2791–2796, doi: 10.1109/CCDC52312.2021.9602325.
[10] P. T. Prasetyaningrum, N. Ibrahim, O. Suria, and M. Buana Yogyakarta, “Optimizing sentiment analysis of hotel reviews using PCA and machine learning for tourism business decision support,” Indonesian J. Inf. Syst., vol. 8, no. 1, p. 36, Aug. 2025.
[11] E. Ismanto, A. Fadlil, A. Yudhana, and K. Kitagawa, “A comparative study of improved ensemble learning algorithms for patient severity condition classification,” J. Electron. Electromed. Eng. Med. Inform., vol. 6, no. 3, pp. 312–321, Jul. 2024, doi: 10.35882/jeeemi.v6i3.452.
[12] M. A. Hossain and M. S. Islam, “A novel hybrid feature selection and ensemble-based machine learning approach for botnet detection,” Sci. Rep., vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-48230-1.
[13] S. Mitrović and N. Vrček, “Methodology for the comparative analysis of PCA and autoencoders in dimensionality reduction: Impact on classification accuracy and computational efficiency,” in Methodology for the Comparative Analysis of PCA and Autoencoders in Dimensionality Reduction: Impact on Classification Accuracy and Computational Efficiency, Varazdin: Central European Conf. on Inf. and Intell. Syst., Sep. 2025.
[14] M. Lasalvia, V. Capozzi, and G. Perna, “A comparison of PCA-LDA and PLS-DA techniques for classification of vibrational spectra,” Appl. Sci. (Switzerland), vol. 12, no. 11, Jun. 2022, doi: 10.3390/app12115345.
[15] L. Hu, L. Gao, Y. Li, P. Zhang, and W. Gao, “Feature-specific mutual information variation for multi-label feature selection,” Inf. Sci. (N Y), vol. 593, pp. 449–471, May 2022, doi: 10.1016/j.ins.2022.02.024.
[16] T. Agustina, M. Masrizal, and I. Irmayanti, “Performance analysis of random forest algorithm for network anomaly detection using feature selection,” Sinkron, vol. 8, no. 2, Apr. 2024, doi: 10.33395/sinkron.v8i2.13625.
[17] M. Awad and S. Fraihat, “Recursive feature elimination with cross-validation with decision tree: Feature selection method for machine learning-based intrusion detection systems,” J. Sens. Actuator Netw., vol. 12, no. 5, Oct. 2023, doi: 10.3390/jsan12050067.
[18] F. K. Ewald, L. Bothmann, M. N. Wright, B. Bischl, G. Casalicchio, and G. König, “A guide to feature importance methods for scientific inference,” Aug. 2024, doi: 10.1007/978-3-031-63797-1_22.
[19] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: An interdisciplinary review,” J. Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00369-8.
[20] A. A. Ibrahim, R. L. Ridwan, M. M. Muhammed, R. O. Abdulaziz, and G. A. Saheed, “Comparison of the CatBoost classifier with other machine learning methods,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 11, 2020.
[21] J. Yan et al., “LightGBM: Accelerated genomically designed crop breeding through ensemble learning,” Genome Biol., vol. 22, no. 1, Dec. 2021, doi: 10.1186/s13059-021-02492-y.
[22] A. Joshi, P. Saggar, R. Jain, M. Sharma, D. Gupta, and A. Khanna, “CatBoost — An ensemble machine learning model for prediction and classification of student academic performance,” Adv. Data Sci. Adapt. Anal., vol. 13, no. 03n04, Jul. 2021, doi: 10.1142/s2424922x21410023.
[23] H. Wang, Q. Liang, J. T. Hancock, and T. M. Khoshgoftaar, “Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods,” J. Big Data, vol. 11, no. 1, Dec. 2024, doi: 10.1186/s40537-024-00905-w.
[24] W. Liang, S. Luo, G. Zhao, and H. Wu, “Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms,” Math., vol. 8, no. 5, May 2020, doi: 10.3390/MATH8050765.
[25] F. Madani and A. H. Lubis, “CatBoost algorithm implementation for classifying women’s fashion products,” J. Inf. Telecommun. Eng., vol. 9, no. 1, 2025, doi: 10.31289/jite.v9i1.15604.
[26] P. P. Putra, M. K. Anam, A. S. Chan, A. Hadi, N. Hendri, and A. Masnur, “Optimizing sentiment analysis on imbalanced hotel review data using SMOTE and ensemble machine learning techniques,” J. Appl. Data Sci., vol. 6, no. 2, pp. 936–951, May 2025, doi: 10.47738/jads.v6i2.618.
[27] Y. Wang, Z. Wu, J. Gao, C. Liu, and F. Guo, “A multi-level classification-based ensemble and feature extractor for credit risk assessment,” PeerJ Comput. Sci., vol. 10, 2024, doi: 10.7717/peerj-cs.1915.
[28] J. Gan and Y. Qi, “Selection of the optimal number of topics for LDA topic model—Taking patent policy analysis as an example,” Entropy, vol. 23, no. 10, Oct. 2021, doi: 10.3390/e23101301.
[29] F. U. Shah, A. U. Khan, A. W. Khan, B. Ullah, M. R. Khan, and I. Javed, “Comparative analysis of ensemble learning algorithms in water quality prediction,” J. Hydroinform., vol. 26, no. 12, pp. 3041–3059, Dec. 2024, doi: 10.2166/hydro.2024.071.
[30] J. Hu and S. Szymczak, “A review on longitudinal data analysis with random forest,” Brief. Bioinform., vol. 24, no. 2, Mar. 2023, doi: 10.1093/bib/bbad002.
[31] S. Widodo, H. Brawijaya, and S. Samudi, “Stratified K-fold cross-validation optimization on machine learning for prediction,” Sinkron, vol. 7, no. 4, pp. 2407–2414, Oct. 2022, doi: 10.33395/sinkron.v7i4.11792.
[32] A. I. Adler and A. Painsky, “Feature importance in gradient boosting trees with cross-validation feature selection,” Entropy, vol. 24, no. 5, May 2022, doi: 10.3390/e24050687.
[33] Y. Meng, N. Yang, Z. Qian, and G. Zhang, “What makes an online review more helpful: An interpretation framework using XGBoost and SHAP values,” vol. 16, no. 3, pp. 466–490, doi: 10.3390/jtaer16030029.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors Declaration
- The Authors certify that they have read, understood, and agreed to the Journal of Information Systems and Informatics (JournalISI) submission guidelines, policies, and submission declaration. The submission has been prepared using the provided template.
- The Authors certify that all authors have approved the publication of this manuscript and that there is no conflict of interest.
- The Authors confirm that the manuscript is their original work, has not received prior publication, is not under consideration for publication elsewhere, and has not been previously published.
- The Authors confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- The Authors confirm that the manuscript is not copied from or plagiarized from any other published work.
- The Authors declare that the manuscript will not be submitted for publication in any other journal or magazine until a decision is made by the journal editors.
- If the manuscript is finally accepted for publication, the Authors confirm that they will either proceed with publication immediately or withdraw the manuscript in accordance with the journal’s withdrawal policies.
- The Authors agree that, upon publication of the manuscript in this journal, they transfer copyright or assign exclusive rights to the publisher, including commercial rights














