چکيده
In particular, this research contributes to the development of a hybrid similarity measure as a new approach to document clustering that combines Cosine Similarity, Minkowski Distance, BM25 Similarity, and Dice Coefficient. Most previous document clustering approaches are based on the measure of lexical co-occurrence, which precludes their capacity to detect latent semantic connections between documents. This feature leads to less accurate levels in the clustering, especially in cases were dealing with ingredients of heterogeneity and ambiguity of the textual chains. In response to these challenges, this research introduces a fusion model of similarity measure that accurately incorporate the lexical and semantic similarity for enhanced clustering coherence.
The work carries on by presenting an overview of semantic similarity assessment techniques discussing their advantages and drawbacks. It then presents the hybrid similarity formula and expounds the method to enhance the formula’s performance with regard to weight tuning mechanism, including Grid Search, Genetic Algorithms, and Particle Swarm Optimization. To validate the efficacy of the new similarity measure and to compare its results with the existing ones, the new measure is tested in conjunction with four clustering algorithms, namely K-means, DBSCAN, and Agglomerative Hierarchical Clustering.
The provided experimental results indicate that the discussed hybrid similarity measure is more efficient than traditional metrics in the case of arranging documents according to their true semantic content, even in case of complicated structures and different contexts are used. The results indicate that the combination of multiple similarity measures can enhance the quality of document clustering up to provide enhanced navigation, accessing, and analysis of the document collections. In further research, more detailed analyses of advanced similarity measures that include contextual embeddings as well as domain knowledge can be recommended as a promising development direction for improving document clustering in various applications and markets.
Keyword: Document Clustering, Semantic Similarity, Hybrid Similarity Measure, Text Mining, Optimization Techniques.