Large Scale Classification and Clusterization of COVID-19 Related Papers

Talibzade, Rustam

Home
→
CB5. ADA Theses, Dissertations and Final Projects
→
School of Information Technologies and Engineering
→
View Item

dc.contributor.author	Talibzade, Rustam
dc.date.accessioned	2024-12-19T23:36:58Z
dc.date.available	2024-12-19T23:36:58Z
dc.date.issued	2023-04
dc.identifier.uri	http://hdl.handle.net/20.500.12181/928
dc.description.abstract	The year 2020 is mostly associated with the outbreak of COVID-19 caused by SARS-CoV 2 coronavirus due to immeasurable effects on our lives. The humanity faced unexpected challenges that were not faced in recent history. Plenty of research was done to find out the ways to combat COVID-19 disease and save as many lives as possible. This led to the emergence of huge number of articles and research papers in COVID-19 related literature, which were hard to keep up with. Several datasets like LitCovid and CORD-19 were created where collections of COVID-19 related literature is stored. To gain benefits and insights from such datasets, there is a need for data analytics and machine learning techniques to analyze these datasets. This Master Thesis research explores a comprehensive analysis of text classification and clustering methodologies including Support Vector Machines (SVM), Naive Bayes, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-negative Matrix Factorization (NMF), BERT, BioBERT, and SciBERT, applied to a large dataset of COVID-19 research articles sourced from the LitCovid database. The primary goal of this research is to devise and assess techniques for organizing, analyzing, and understanding the swiftly expanding collection of scientific literature pertaining to COVID-19. The research is structured into multiple phases. Initially, a thorough literature review is conducted to establish a robust understanding of the cutting-edge developments in NLP, text classification, clustering and topic modelling. This review encompasses traditional machine learning techniques including supervised and unsupervised clustering algorithms. Their applications on different datasets including COVID-19 related datasets like CORD-19 and LitCovid are also discussed. Next, description of LitCovid dataset is provided. Afterwards, machine learning techniques mentioned above are applied using different word vectorization techniques including Bag-Of Words, TF-IDF and Word2Vec to identify how certain algorithms behave with these vectorization methods. In the results and analysis section, the author offers a comprehensive comparison of all classification, topic modelling and clusterization approaches used for COVID-19 research articles. Finally, in the summary and future work section, the author consolidates key findings and considers potential work for future research.	en_US
dc.language.iso	en	en_US
dc.publisher	ADA University	en_US
dc.relation	School of IT and Engineering	en_US
dc.relation	Graduate program	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
dc.subject	COVID-19 (Disease) -- Research	en_US
dc.subject	Natural language processing (Computer science)	en_US
dc.subject	Machine learning -- Medical applications	en_US
dc.subject	IT and Engineering	en_US
dc.title	Large Scale Classification and Clusterization of COVID-19 Related Papers	en_US
dc.type	Thesis	en_US