Semantic-Enhanced News Clustering Using TF-IDF and WordNet with K-Means
Abstract
Text clustering of news articles falls under unsupervised learning, where models operate on unlabeled data unless partially annotated. K-Means Clustering remains one of the most commonly applied algorithms due to its efficiency and simplicity. Likewise, TF-IDF is a widely used approach for generating document feature matrices through statistical term weighting. Although still relevant, TF-IDF lacks the ability to represent contextual meaning, which often prevents semantically related news articles from forming coherent clusters when different syntactic variations are used. This limitation is evidenced by the baseline experiment, in which TF-IDF obtained a silhouette score of 0.011 at the optimal cluster configuration (k = 5). To overcome this limitation, this study introduces semantic enrichment using WordNet to improve similarity representation based on keywords extracted through TF-IDF, evaluated on 1000 documents sampled from 21,495 filtered records. The elbow method was applied to determine the optimal number of clusters. At the optimal k-value of 3, the proposed method achieved a silhouette score of 0.505, significantly outperforming the baseline TF-IDF representation despite utilizing fewer clusters. These results demonstrate that incorporating semantic information can enhance statistical text representations and produce more contextually coherent news clusters. To manage computational task, the model applies a first-POS strategy, where only the first synset derived from POS tagging is considered. While this reduces processing complexity, it may limit the model's ability to fully capture polysemy.
Downloads
References
D. B. Bisandu, R. Prasad, and M. M. Liman, “Clustering news articles using efficient similarity measure and N-grams,” Int. J. Knowl. Eng. Data Min., vol. 5, no. 4, p. 333, 2018, doi: 10.1504/IJKEDM.2018.095525.
N. Disayiram and R. A. H. M. Rupasingha, “A comparative study of clustering english news articles using clustering algorithms,” in 2022 International Research Conference on Smart Computing and Systems Engineering (SCSE), Colombo, Sri Lanka: IEEE, Sept. 2022, pp. 108–113. doi: 10.1109/SCSE56529.2022.9905210.
C. Bouras and V. Tsogkas, “A clustering technique for news articles using WordNet,” Knowl.-Based Syst., vol. 36, pp. 115–128, Dec. 2012, doi: 10.1016/j.knosys.2012.06.015.
A. El-Hamdouchi, “Comparison of hierarchic agglomerative clustering methods for document retrieval,” Comput. J., vol. 32, no. 3, pp. 220–227, Mar. 1989, doi: 10.1093/comjnl/32.3.220.
A. Subakti, H. Murfi, and N. Hariadi, “The performance of BERT as data representation of text clustering,” J. Big Data, vol. 9, no. 1, p. 15, Dec. 2022, doi: 10.1186/s40537-022-00564-9.
Z. Chen, C. Mi, S. Duo, J. He, and Y. Zhou, “ClusTop: An unsupervised and integrated text clustering and topic extraction framework,” Jan. 03, 2023, arXiv: arXiv:2301.00818. doi: 10.48550/arXiv.2301.00818.
H. T. A. Simanjuntak, P. E. P. Silaban, J. K. S. Manurung, and V. H. Sormin, “Klasterisasi berita bahasa indonesia dengan menggunakan k-means dan word embedding,” J. Teknol. Inf. Dan Ilmu Komput., vol. 10, no. 3, pp. 641–652, July 2023, doi: 10.25126/jtiik.20231026468.
S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Hum.-Centric Comput. Inf. Sci., vol. 9, no. 1, p. 30, Dec. 2019, doi: 10.1186/s13673-019-0192-7.
E. Kurniawan and N. Hendrastuty, “Penerapan algoritma k-means untuk melakukan klasterisasi pada peringkasan teks,” J. Inform. Teknol. Dan Sains Jinteks, vol. 6, no. 3, pp. 514–520, Aug. 2024, doi: 10.51401/jinteks.v6i3.4435.
Aubaidan, “Comparative study of k-means and k-means++ clustering algorithms on crime domain,” J. Comput. Sci., vol. 10, no. 7, pp. 1197–1206, July 2014, doi: 10.3844/jcssp.2014.1197.1206.
L. M. Abualigah, A. T. Khader, and M. A. Al-Betar, “Multi-objectives-based text clustering technique using K-mean algorithm,” in 2016 7th International Conference on Computer Science and Information Technology (CSIT), Amman, Jordan: IEEE, July 2016, pp. 1–6. doi: 10.1109/CSIT.2016.7549464.
J. Ravi and S. Kulkarni, “Text embedding techniques for efficient clustering of twitter data,” Evol. Intell., vol. 16, no. 5, pp. 1667–1677, Oct. 2023, doi: 10.1007/s12065-023-00825-3.
K. K. Saravanakumar, M. Ballesteros, M. K. Chandrasekaran, and K. McKeown, “Event-driven news stream clustering using entity-aware contextual embeddings,” Jan. 26, 2021, arXiv: arXiv:2101.11059. doi: 10.48550/arXiv.2101.11059.
S. Yeasmin, N. Afrin, and M. R. Huq, “Transformer-based text clustering for newspaper articles,” in machine intelligence and emerging technologies, vol. 490, Md. S. Satu, M. A. Moni, M. S. Kaiser, and M. S. Arefin, Eds., in Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol. 490. , Cham: Springer Nature Switzerland, 2023, pp. 443–457. doi: 10.1007/978-3-031-34619-4_35.
T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, “A semantic approach for text clustering using WordNet and lexical chains,” Expert Syst. Appl., vol. 42, no. 4, pp. 2264–2275, Mar. 2015, doi: 10.1016/j.eswa.2014.10.023.
Kumar Saksham, “Global News Dataset.” Kaggle. doi: 10.34740/KAGGLE/DSV/7105651.
C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval, 1st ed. Cambridge University Press, 2008. doi: 10.1017/CBO9780511809071.
G. U. Abriani and M. A. Yaqin, “Implementasi metode semantic similarity untuk pengukuran kemiripan makna antar kalimat,” Ilk. J. Comput. Sci. Appl. Inform., vol. 1, no. 2, pp. 47–57, Dec. 2019, doi: 10.28926/ilkomnika.v1i2.15.
B. Montolalu and S. Rochimah, “Deteksi konflik leksikal pada diagram kelas menggunakan modifikasi graf dan similaritas wordnet,” Syst. Inf. Syst. Inform. J., vol. 3, no. 1, pp. 1–8, Aug. 2017, doi: 10.29080/systemic.v3i1.187.
A. Géron, Hands-On machine learning with scikit-learn, keras, and tensorflow: concepts, tools, and techniques to build intelligent systems, 2nd ed. Sebastopol: O’Reilly, 2019.
F. Malik, S. Khan, A. Rizwan, G. Atteia, and N. A. Samee, “A novel hybrid clustering approach based on black hole algorithm for document clustering,” IEEE Access, vol. 10, pp. 97310–97326, 2022, doi: 10.1109/ACCESS.2022.3202017.
J. Han and M. Kamber, Data mining: concepts and techniques, 3rd ed. Burlington, MA: Elsevier, 2012.
M. J. P. Canon, L. L. Maceda, and C. Y. Sy, “Clustering with enhanced word embeddings for contextual analysis in academic texts,” Int. J. Eng. Trends Technol., vol. 72, no. 6, pp. 170–177, June 2024, doi: 10.14445/22315381/IJETT-V72I6P118.
S. Das and U. Mert Cakmak, Hands-On Automated Machine Learning. Sciendo, 2018. doi: 10.0000/9781788622288.
C. C. Aggarwal, Data Mining: The Textbook. Cham: Springer International Publishing, 2015. doi: 10.1007/978-3-319-14142-8.
Abstract views: 36 times
Download PDF: 24 times
Copyright (c) 2025 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
- I certify that I have read, understand and agreed to the Journal of Information Systems and Informatics (Journal-ISI) submission guidelines, policies and submission declaration. Submission already using the provided template.
- I certify that all authors have approved the publication of this and there is no conflict of interest.
- I confirm that the manuscript is the authors' original work and the manuscript has not received prior publication and is not under consideration for publication elsewhere and has not been previously published.
- I confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- I confirm that the paper now submitted is not copied or plagiarized version of some other published work.
- I declare that I shall not submit the paper for publication in any other Journal or Magazine till the decision is made by journal editors.
- If the paper is finally accepted by the journal for publication, I confirm that I will either publish the paper immediately or withdraw it according to withdrawal policies
- I Agree that the paper published by this journal, I transfer copyright or assign exclusive rights to the publisher (including commercial rights)














