Abstract:
Information contained in printed or digitized chemical structures is required to research
and develop new chemical products. Currently, few automatic recognition and translation
systems of structural formulas exist in the industry leading to a lot of manual effort spent
on their identification and analysis. Many older printed publications remain on paper due
to the amount of effort required to accurately translate them to a computer-friendly format.
Machine learning solutions are in development to address this problem yet such issues as
drawing style variations and low-quality are not well-accounted for in existing research.
This leads to unstable model predictions of incoming data from older sources. Hence, the
purpose of this study is to develop a low-quality and position-agnostic chemical structure
recognition model to address the problem. The study analyzes and develops several feature
extraction methods, custom augmentations that preserve correct textual orientations,
applications of random noise to translate images to sequences using LSTM networks with
attention. The results show that InceptionV3 extraction method performs significantly
better than Autoencoders due to its depth and several differently scaled filters. The
baseline image-to-sequence model achieves a minimum Levenshtein score of circa 19
characters on the validation set, which constitutes approximately a 10% error rate. Custom
augmentations and lowering of image quality do not significantly impact the score, which
can be due to text ordering, random placement of noise and model overfitting on the
original dataset.