Abstract:
In this digital era, the explosion of textual data is causing us to develop sophisticated text mining
and clustering methods. Although the state of art has improved for most well-resourced languages,
relatively little research had been carried out on a language with smaller resource like Azerbaijani.
In this thesis I investigated using clustering algorithms to enhance the information and
communication access in Azerbaijani speaking community.
15,500 news articles were used compiled as a part of oxu.az. So, K-means, Fuzzy-Kmeans,
Agglomerative Hierarchical Clustering, Spectral Clustering along with Gaussian Mixture Model
(GMM) and Latent Dirichlet Allocation were deployed. They were evaluated on the basis of
Silhouette Score (SS) and Davies-Bouldin Index. Word2Vec embeddings yield higher ARI than
TF-IDF, while Spectral Clustering and LDA report superior scores owing to their capability of
mapping complex workout nodes.
The future works will improve the Pre-processing, hybrid Clustering and Deep Learning
Embeddings. Applications to real-world problems ranging from recommendation systems and
content categorization, all of which will build experience with the models.