Vision Transformers, or ViTs, are a type of deep learning model that uses self-attention mechanisms to process visual data.
ViTs were introduced in a paper by Dosovitskiy in 2020 and have since gained popularity in computer vision research.
In a Vision Transformer, an image is first divided into patches, which are then flattened and fed into a multi-layer transformer network. A multi-head self-attention mechanism in each transformer layer allows the model to focus on the relationships between image patches at differing levels of abstraction to capture local and global features.
The transformer’s output is passed through a final classification layer to obtain the predicted class label.