The Feature Selection vs Dimensionality Reduction for Steam Game Metadata Classification: An Ensemble Learning Study

Ferdi Setyo Handika; Lili Dwi Yulianto; Septi Andryana

doi:10.63158/journalisi.v8i1.1456

Authors

Ferdi Setyo Handika Indonesia
Lili Dwi Yulianto Indonesia
Septi Andryana Indonesia

DOI:

https://doi.org/10.63158/journalisi.v8i1.1456

Keywords:

CatBoost, Dimensionality Reduction, Feature Selection, Binary Classification, Steam Metadata

Abstract

Optimizing noisy Steam game metadata is essential for accurate binary classification. This study compares feature selection (MI) and dimensionality reduction (PCA, LDA) using a dataset of 55,144 Steam reviews and four ensemble algorithms, evaluated through Stratified 5-Fold Cross-Validation. The results show that the 125-feature baseline achieved the highest accuracy of 0.7728 with CatBoost. Feature selection (FS_10) maintained competitive performance with an accuracy of 0.7449, while LDA, after optimization, achieved 0.7281. In contrast, PCA significantly hindered performance (0.6963) due to the inability of linear transformations to preserve the discriminative power of one-hot encoded categorical features, which ensemble models handle better in their original form. These findings highlight the importance of preserving original features, especially in low-to-medium dimensional metadata, where feature selection outperforms dimensionality reduction in maintaining predictive integrity. High accuracy is crucial for developers to track product reception and for platforms to improve recommendation systems that influence user purchasing decisions. The study concludes that for Steam game metadata, strategic feature selection is superior to dimensionality reduction for maintaining classification performance.

Downloads

Download data is not yet available.

References

[1] M. D. Purbolaksono, “Sentiment analysis of game review in Steam platform using Random Forest,” Int. J. Inf. Commun. Technol., vol. 10, no. 2, pp. 161–169, Dec. 2024, doi: 10.21108/ijoict.v10i2.1007.

[2] A. A. Soetasad and E. Fernando, “Comparison of machine learning and deep learning algorithms on sentiment analysis in game reviews,” Int. J. Innov. Res. Sci. Stud., vol. 8, no. 7, pp. 64–71, Oct. 2025, doi: 10.53894/ijirss.v8i7.10401.

[3] Y. Meng, N. Yang, Z. Qian, and G. Zhang, “What makes an online review more helpful: An interpretation framework using XGBoost and SHAP values,” J. Theor. Appl. Electron. Commer. Res., vol. 16, no. 3, pp. 466–490, 2021, doi: 10.3390/jtaer16030029.

[4] R. Egger and J. Yu, “A topic modeling comparison between LDA, NMF, Top2Vec, and BERTopic to demystify Twitter posts,” Front. Sociol., vol. 7, May 2022, doi: 10.3389/fsoc.2022.886498.

[5] J. Varghese and P. T. Selvan, “Feature reduction based LDA with SVM classification on dimensionality reduction for big data,” Int. J. Health Sci. (Qassim), pp. 9415–9431, May 2022, doi: 10.53730/ijhs.v6ns2.7461.

[6] T. Rajendran et al., “Optimizing prediction accuracy in high-dimensional data: Comparative analysis of feature selection methods with Naive Bayes algorithm,” SSRG Int. J. Electron. Commun. Eng., vol. 11, no. 3, pp. 41–52, Mar. 2024, doi: 10.14445/23488549/IJECE-V11I3P105.

[7] M. Buyukkececi and M. C. Okur, “A comprehensive review of feature selection and feature selection stability in machine learning,” Dec. 01, 2023, Gazi Universitesi. doi: 10.35378/gujs.993763.

[8] F. Anowar, S. Sadaoui, and B. Selim, “Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE),” May 01, 2021, Elsevier Ireland Ltd. doi: 10.1016/j.cosrev.2021.100378.

[9] S. Feng and H. Wang, “Comparison of PCA and LDA dimensionality reduction algorithms based on wine dataset,” in Proc. 33rd Chinese Control Decision Conf., CCDC 2021, Beijing: IEEE, Nov. 2021, pp. 2791–2796, doi: 10.1109/CCDC52312.2021.9602325.

[10] P. T. Prasetyaningrum, N. Ibrahim, O. Suria, and M. Buana Yogyakarta, “Optimizing sentiment analysis of hotel reviews using PCA and machine learning for tourism business decision support,” Indonesian J. Inf. Syst., vol. 8, no. 1, p. 36, Aug. 2025.

[11] E. Ismanto, A. Fadlil, A. Yudhana, and K. Kitagawa, “A comparative study of improved ensemble learning algorithms for patient severity condition classification,” J. Electron. Electromed. Eng. Med. Inform., vol. 6, no. 3, pp. 312–321, Jul. 2024, doi: 10.35882/jeeemi.v6i3.452.

[12] M. A. Hossain and M. S. Islam, “A novel hybrid feature selection and ensemble-based machine learning approach for botnet detection,” Sci. Rep., vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-48230-1.

[13] S. Mitrović and N. Vrček, “Methodology for the comparative analysis of PCA and autoencoders in dimensionality reduction: Impact on classification accuracy and computational efficiency,” in Methodology for the Comparative Analysis of PCA and Autoencoders in Dimensionality Reduction: Impact on Classification Accuracy and Computational Efficiency, Varazdin: Central European Conf. on Inf. and Intell. Syst., Sep. 2025.

[14] M. Lasalvia, V. Capozzi, and G. Perna, “A comparison of PCA-LDA and PLS-DA techniques for classification of vibrational spectra,” Appl. Sci. (Switzerland), vol. 12, no. 11, Jun. 2022, doi: 10.3390/app12115345.

[15] L. Hu, L. Gao, Y. Li, P. Zhang, and W. Gao, “Feature-specific mutual information variation for multi-label feature selection,” Inf. Sci. (N Y), vol. 593, pp. 449–471, May 2022, doi: 10.1016/j.ins.2022.02.024.

[16] T. Agustina, M. Masrizal, and I. Irmayanti, “Performance analysis of random forest algorithm for network anomaly detection using feature selection,” Sinkron, vol. 8, no. 2, Apr. 2024, doi: 10.33395/sinkron.v8i2.13625.

[17] M. Awad and S. Fraihat, “Recursive feature elimination with cross-validation with decision tree: Feature selection method for machine learning-based intrusion detection systems,” J. Sens. Actuator Netw., vol. 12, no. 5, Oct. 2023, doi: 10.3390/jsan12050067.

[18] F. K. Ewald, L. Bothmann, M. N. Wright, B. Bischl, G. Casalicchio, and G. König, “A guide to feature importance methods for scientific inference,” Aug. 2024, doi: 10.1007/978-3-031-63797-1_22.

[19] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: An interdisciplinary review,” J. Big Data, vol. 7, no. 1, Dec. 2020, doi: 10.1186/s40537-020-00369-8.

[20] A. A. Ibrahim, R. L. Ridwan, M. M. Muhammed, R. O. Abdulaziz, and G. A. Saheed, “Comparison of the CatBoost classifier with other machine learning methods,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 11, 2020.

[21] J. Yan et al., “LightGBM: Accelerated genomically designed crop breeding through ensemble learning,” Genome Biol., vol. 22, no. 1, Dec. 2021, doi: 10.1186/s13059-021-02492-y.

[22] A. Joshi, P. Saggar, R. Jain, M. Sharma, D. Gupta, and A. Khanna, “CatBoost — An ensemble machine learning model for prediction and classification of student academic performance,” Adv. Data Sci. Adapt. Anal., vol. 13, no. 03n04, Jul. 2021, doi: 10.1142/s2424922x21410023.

[23] H. Wang, Q. Liang, J. T. Hancock, and T. M. Khoshgoftaar, “Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods,” J. Big Data, vol. 11, no. 1, Dec. 2024, doi: 10.1186/s40537-024-00905-w.

[24] W. Liang, S. Luo, G. Zhao, and H. Wu, “Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms,” Math., vol. 8, no. 5, May 2020, doi: 10.3390/MATH8050765.

[25] F. Madani and A. H. Lubis, “CatBoost algorithm implementation for classifying women’s fashion products,” J. Inf. Telecommun. Eng., vol. 9, no. 1, 2025, doi: 10.31289/jite.v9i1.15604.

[26] P. P. Putra, M. K. Anam, A. S. Chan, A. Hadi, N. Hendri, and A. Masnur, “Optimizing sentiment analysis on imbalanced hotel review data using SMOTE and ensemble machine learning techniques,” J. Appl. Data Sci., vol. 6, no. 2, pp. 936–951, May 2025, doi: 10.47738/jads.v6i2.618.

[27] Y. Wang, Z. Wu, J. Gao, C. Liu, and F. Guo, “A multi-level classification-based ensemble and feature extractor for credit risk assessment,” PeerJ Comput. Sci., vol. 10, 2024, doi: 10.7717/peerj-cs.1915.

[28] J. Gan and Y. Qi, “Selection of the optimal number of topics for LDA topic model—Taking patent policy analysis as an example,” Entropy, vol. 23, no. 10, Oct. 2021, doi: 10.3390/e23101301.

[29] F. U. Shah, A. U. Khan, A. W. Khan, B. Ullah, M. R. Khan, and I. Javed, “Comparative analysis of ensemble learning algorithms in water quality prediction,” J. Hydroinform., vol. 26, no. 12, pp. 3041–3059, Dec. 2024, doi: 10.2166/hydro.2024.071.

[30] J. Hu and S. Szymczak, “A review on longitudinal data analysis with random forest,” Brief. Bioinform., vol. 24, no. 2, Mar. 2023, doi: 10.1093/bib/bbad002.

[31] S. Widodo, H. Brawijaya, and S. Samudi, “Stratified K-fold cross-validation optimization on machine learning for prediction,” Sinkron, vol. 7, no. 4, pp. 2407–2414, Oct. 2022, doi: 10.33395/sinkron.v7i4.11792.

[32] A. I. Adler and A. Painsky, “Feature importance in gradient boosting trees with cross-validation feature selection,” Entropy, vol. 24, no. 5, May 2022, doi: 10.3390/e24050687.

[33] Y. Meng, N. Yang, Z. Qian, and G. Zhang, “What makes an online review more helpful: An interpretation framework using XGBoost and SHAP values,” vol. 16, no. 3, pp. 466–490, doi: 10.3390/jtaer16030029.

The Feature Selection vs Dimensionality Reduction for Steam Game Metadata Classification: An Ensemble Learning Study

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

publisher

sidebar

certificate

template

gs-citation

index

stat