Could It Be the Patches? This AI Approach Analyzes the Key Contributor to the Success of Vision Transformers
Convolutional neural networks (CNNs) have been the backbone of systems for computer vision tasks. They have been the go-to architecture for all types of problems, from object detection to image super-resolution. In fact, the famous leaps (e.g., AlexNet) in the deep learning domain have been made possible thanks to convolutional neural networks. However, things changed when a new architecture based on Transformer models, called the Vision Transformer (ViT), showed promising results and outperformed classical convolutional architectures, especially for large data sets. Since then, the field has been looking to enable ViT-based solutions for problems that have been tackled with CNNs
marktechpost.com