Abstract:
The problem of optical chemical structure recognition has been tackled by various researchers
using both rule-based and machine learning approaches. However, it still does not have a viable
solution that would produce the end-to-end pipeline with high enough accuracy. The
approaches tried in this research include implementation of the concept of Transformers to
solve this problem as well as image manipulation tactics. The research is focused around
applying attention mechanism used in Transformer architecture and Transfer Learning to arrive
at results with low Levenshtein Distance, which is a measure of difference between the actual
and predicted label for chemical images. The label for images in the study is InChI. Several
setups, including Vision Transformers in combination with Vanilla Decoders, as well as
EfficientNetV2 backbone with Transformer Encoder and Decoder have been tried. The study
suggests that using EfficientNetV2 in couple with Transformer architecture produces best
results for the chemical images in Bristol-Myers Squibb dataset published in 2021
electronically. Additionally, resizing with padding instead of stretching produces significantly
better results due prevention of information loss. Background and foreground inversion appears
to improve the results as well. As a suggestion, further work is recommended to increase the
number of epochs and generalize the results for the full dataset instead of a sample used in the
study.