Meta Paper Explores the Impact of Transformers on Individual Pixels

A novel architecture called Pixel Transformer (PiT) has been proposed, which can directly input each pixel as a token into the Transformer without first segmenting the image into patches. The advantage of PiT is that it removes the locality bias inherent in the steps of convolution and patchification, allowing the model to autonomously learn feature representations at the pixel level. Experiments have proven that PiT achieves superior results to ViT in tasks such as image classification, self-supervised learning, and image generation.

This work does not introduce a new method.

Instead, we present an interesting finding that questions the necessity of inductive bias—locality in modern computer vision architectures.

Specifically, we found that vanilla Transformers can operate and achieve high-performance results by directly treating each individual pixel as a token.

This is quite different from the popular design in Vision Transformers, which maintains the inductive bias of local neighborhoods in ConvNets (for example, by treating each 16x16 block as a token).

We mainly demonstrate the effectiveness of pixels as tokens in three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning through masked auto-encoding, and image generation using diffusion models.

Although operating directly on individual pixels is computationally impractical, we believe the community must be aware of this surprising knowledge when designing the next generation of computer vision neural architectures.