Abstract:
The year 2020 is mostly associated with the outbreak of COVID-19 caused by SARS-CoV 2 coronavirus due to immeasurable effects on our lives. The humanity faced unexpected challenges that were not faced in recent history. Plenty of research was done to find out the
ways to combat COVID-19 disease and save as many lives as possible. This led to the emergence of huge number of articles and research papers in COVID-19 related literature, which were hard to keep up with. Several datasets like LitCovid and CORD-19 were created where collections of COVID-19 related literature is stored. To gain benefits and insights from such datasets, there is a need for data analytics and machine learning techniques to analyze these datasets.
This Master Thesis research explores a comprehensive analysis of text classification and clustering methodologies including Support Vector Machines (SVM), Naive Bayes, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-negative Matrix Factorization (NMF), BERT, BioBERT, and SciBERT, applied to a large dataset of COVID-19 research articles sourced from the LitCovid database. The primary goal of this research is to devise and assess techniques for organizing, analyzing, and understanding the swiftly expanding collection of scientific literature pertaining to COVID-19. The research is structured into multiple phases. Initially, a thorough literature review is
conducted to establish a robust understanding of the cutting-edge developments in NLP, text classification, clustering and topic modelling. This review encompasses traditional machine learning techniques including supervised and unsupervised clustering algorithms. Their
applications on different datasets including COVID-19 related datasets like CORD-19 and LitCovid are also discussed.
Next, description of LitCovid dataset is provided. Afterwards, machine learning techniques mentioned above are applied using different word vectorization techniques including Bag-Of Words, TF-IDF and Word2Vec to identify how certain algorithms behave with these
vectorization methods.
In the results and analysis section, the author offers a comprehensive comparison of all classification, topic modelling and clusterization approaches used for COVID-19 research articles. Finally, in the summary and future work section, the author consolidates key findings
and considers potential work for future research.