This work does not introduce a new method.
Instead, we present an interesting finding that questions the necessity of inductive bias—locality in modern computer vision architectures.
Specifically, we found that vanilla Transformers can operate and achieve high-performance results by directly treating each individual pixel as a token.
This is quite different from the popular design in Vision Transformers, which maintains the inductive bias of local neighborhoods in ConvNets (for example, by treating each 16x16 block as a token).
We mainly demonstrate the effectiveness of pixels as tokens in three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning through masked auto-encoding, and image generation using diffusion models.
Although operating directly on individual pixels is computationally impractical, we believe the community must be aware of this surprising knowledge when designing the next generation of computer vision neural architectures.