Abstract:
This paper presents the creation of a large-scale Azerbaijani language corpus with more than 50 million tokens, and the development of several functionalities for language analysis and corpus linguistics, including Word Frequency, Ngrams, Concordance, Thesaurus, and Word Sketch. The corpus was collected from various sources, including Azerbaijani books, articles, and websites, and was stored in a relational database. The paper provides a detailed description of the corpus creation process and the database schema used to store the corpus,
as well as dives into the creation of each of the functionality of the corpus, and what kind of insights it is possible to get from the given functionality set. Afterwards, the paper analyzes different corpus applications and analyzes their interfaces and user experience provided by the application, before introducing the online application for the Azerbaijani language corpus to make the corpus and its functionalities available to the linguists, researchers and language learners. The functionalities were implemented using Python, and the user interface was created using Next.js. The final product is a web application that allows users to access all the functionalities of the corpus easily.