Abstract:
As technology is being improved day by day, accessibility to the different resources is being
cleared up. Today, several technologies are aimed to solve problems for disabled people. One of the
obstacles is that people from deaf/mute people communities have difficulties creating healthy
communication with others, especially those who are not from that community. The technology that
solves such problems is known as Sign Language Recognition (SLR) systems.
There are approximately 50,000 deaf/mute people in Azerbaijan. They have their very own
sign language called Azerbaijani Sign Language (AzSL). It is inevitable fact that people not from
deaf/mute people communities do not know the AzSL except sign language translators. There are 32
letters in Azerbaijani Sign Language. 24 of them are static letters which mean it is interpreted as
forming hand parts to a specific orientation. These letters can be illustrated in just one frame. Rest 8
letters are dynamic which are just like static letters, but hands needed to move such as up and down,
rotation, or anything else. Those letters cannot be illustrated in a single frame. They are a bunch of
frames similar to videos. Other than letters, all words are also dynamic. Our goal in this paper is to
come up with an SLR system that reads the video from the live camera and converts it into text in
real-time.
AzSL is different than other well-known sign languages such as American, German, French,
Russian, and others. Hereby, there is no such dataset that contains letters and words of AzSL.
Therefore, the first task was to collect both qualitative and quantitative datasets. For that goal, we
created a Telegram bot where volunteer users can capture pictures (for static letters) and videos (for
dynamic letters) according to samples provided and can upload them to servers. Users were mostly
students of ADA University. In total, approximately 14,000 pictures and 3,000 videos are collected.
For further research and applications, data for words that are dynamic is being collected. In this
paper, the scope of the aim is to develop a recognition system for static letters only.
For this research, a sufficient number of papers have been read. For our dataset, we detect
that using “MediaPipe” for feature extraction is the best option. MediaPipe is an open-source
framework that helps users to extract important landmarks from human body parts. In our project,
we only use hand landmarks as there is not any effect of human pose or facial emotions in AzSL. It
extracts 21 hand joints for a single hand, and each of them has 3 parameters. Hereby, the size of the
input becomes 63 (21x3). If both hands are present that number becomes 126 (2x21x3). Another
approach was to train raw images in Convolutional Neural Network (CNN) with different
parameters. However, because of the few samples and computational power, all experiments with
CNN could not reach the desired level of performance. Coming back to MediaPipe features, they are
trained in different classifiers including Logistic Regression, Multilayer Perceptrons, Deep Neural
Network, and others. There are similar letters that models cannot generalize well. For this reason, 2
level DNNs architecture was designed to train similar letters separately. It can be considered as also
clusterization. This architecture gave the best result with 94% of test accuracy.