Get a Clearer Picture of Vision Transformers’ Potential at the Edge

— Scenario: Corporate security staff get an alert that a video camera has detected a former employee entering an off-limits building.

— Scenario: A radiologist receives a flag that an MRI contains early markers for potentially abnormal tissue growth.

What do these scenarios have in common? They are real-world tasks can be made more accurate and scalable by the Akida Vision Transformer, or ViT. Some other areas that benefit from ViTs include agriculture, manufacturing, and smart devices. These mission-critical settings demand image analysis that is not only accurate, but also is performed in real-time and often on cost- effective portable devices.

A new video on our YouTube channel explains how ViTs work and how Akida supports them.

A ViT is a specialized version of a Transformer Neural Network. It can analyze an image by superimposing a grid of equal-sized “patches” which are then flattened and embedded linearly. After position embeddings are added, the resulting sequence of vectors is fed to the Transformer encoder.

The resource-hungry process of computing attention values has long kept Vision Transformers trapped in the Cloud, where memory and compute resources are abundant. However, the barriers have come down for deploying Vision Transformers in offline settings. Akida’s 2nd generation platform optimizes patch and position embeddings, the encoder block, and the Multi-layer Perceptron blocks in hardware in a power envelope conducive to battery operated, portable devices. This will accelerate the transition to greater capabilities in Edge devices that deliver processing that is meaningful, accurate, of high resolution, and at high frame rates.

But wait. Couldn’t the same tasks already be accomplished with Convolutional Neural Networks (CNNs)? They can and they have been. CNNs work well in traditional computing environments, but they have some limitations. The pixel basis and pooling-layers technique that CNNs use to extract relevant features are localized, while Vision Transformers relate features that are farther apart. In fact, analysis shows that ViTs are more flexible and capture global features better and more accurately than CNNs in working from intermediate representations.

A self-attention mechanism allows the model to attend to different parts of the image at different scales and simultaneously capture global and local features. You may have heard of multi-head attention in the context of Natural Language Processing, which uses it to examine dependencies between words within text. This approach underpins popular AI applications, including ChatGPT. This multi-head attention is applied in Akida’s vision transformer encoders.

Similarly, ViTs can learn features from spatially related regions and enhance object detection, tracking, and segmentation. ViTs also have better generalization when trained on a distributed dataset.

Enabled by Akida, ViTs can be applied remotely at very efficient power consumption to serve multiple industries. We will be starting a learning blog series over the next few weeks and provide more detail. Learn more about ViTs on our website or contact us and learn how we can help you get a clearer picture.