This work introduces Depth Anything V2.
Rather than pursuing fancy techniques, we aim to reveal key findings that pave the way for building powerful monocular depth estimation models.
Notably, compared to V1, this version produces finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) expanding the capacity of our teacher model, and 3) teaching the student model through the bridge of large-scale pseudo-labeling of real images.
Compared to the latest models built on stable diffusion, our model is more efficient (over 10 times faster) and more accurate. We offer models of different sizes (ranging from 25M to 1.3B parameters) to support a wide range of scenarios.
Thanks to their strong generalization capabilities, we fine-tune them with metric depth labels to obtain our metric depth models.
In addition to our models, considering the limited diversity and frequent noise in the current test sets, we have constructed a versatile evaluation benchmark with precise annotations and diverse scenes to facilitate future research.