Abstract:
Capsule Networks (CapsNets) are a relatively recent deep learning architecture developed to alleviate
distinct disadvantages of standard Convolutional Neural Networks (CNNs): their incapacity to
effectively model spatial hierarchies, to account for pose relationships between features. Accumulating
part-whole relationships and encoding spatial information, CapsNets substitute scalar-output neurons
with vector-based capsules, and use a dynamic routing-by-agreement mechanism. In this thesis we
comprehensively analyse the architectural principles and range of CapsNet capabilities, starting with
their use of image classification tasks. We present benchmark analysis of the performance of CapsNets
on a series of datasets, like MNIST and smallNORB, asthey demonstrate performance comparable and
often superior to traditional CNNs using significantly fewer parameters, whilst also demonstrating
robustness to affine transformations and occlusion of objects.
Aside from computer vision, we also extend CapsNets to Natural Language Processing (NLP), and
explore their suitability for text classification tasks end-to-end. Specifically, we look to implement a
CapNet based text classifier for a sentiment analysis project in Azerbaijani: a morphologically rich and
under-resourced language. Using a dataset of around 160,000 user review (Hajili’s Azerbaijani Review
Sentiment Classification dataset), we take the stance of building a CapsNet based text classifier and
contrast such a model with baseline CNN and LSTM architectures. Given the experiments will be
implemented end-to-end, using available Python based deep learning frameworks, we report that the
CapNet model gives unstandardized resultsthat are marginally superior based on accuracy and F1 score,
as well asindication that it also has a capacity of modeling some form of semantic hierarchy in language
too.
In the literature review section we conduct a historical and contemporary survey of work situated on
CapsNet research, including advances in the form of Matrix Capsules with EM Routing, and more
recent routing algorithms that take attention based mechanisms. We provide our viewpoint on some of
the architectural trade-offs, including the computational expenses and instability in converging weights
for training of large-scale deployment, whilst we make mention of the desirable aspects of using
capsule-inspired representations in both visual and language tasks.
To summarize, we position CapsNets as a recent conceptual architectural innovation in bridging the
phase change between spatial data and sequential data processing. With that in mind, we also propose
future research directions, in particular the coupling of CapsNets with Transformer-based models to
create hybrid architectures with the potential for even greater performance outcome predictions on lowresource NLP tasks. Both empirical outcomes, and new concepts of architecture we have made, should
serve to encourage the broader take-up of capsule-based models in cross-domain learning.