TurboVGGT: Fast Visual Geometry Reconstruction
with Adaptive Alternating Attention

David Huang1,2,*,† Guile Wu1,† Chengjie Huang1 Bingbing Liu3 Dongfeng Bai1
1Huawei Noah's Ark Lab    2University of Toronto    3Foundation Model Department, Huawei
*David Huang contributed to this work during an internship at Huawei Canada. Equal contribution.  
TurboVGGT Illustration

Figure 1. TurboVGGT achieves fast multi-view 3D reconstruction while maintaining competitive reconstruction quality.

Abstract

Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods.

Method

TurboVGGT Architecture

Figure 2. The overall framework of TurboVGGT.

Contributions

Results

Qualitative Results

Figure 3. Qualitative comparison of point cloud reconstruction, camera pose estimation, and depth estimation.

Supplementary Point Cloud Visualization

Figure 4. Qualitative results for point cloud reconstruction.

Supplementary Camera Pose Visualization

Figure 5. Qualitative results for camera pose estimation.

Supplementary Depth Estimation Visualization

Figure 6. Qualitative results for depth estimation.

BibTeX

If you find this work useful, please cite our paper:

@article{huang2026turbovggt, title = {TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention}, author = {Huang, David and Wu, Guile and Huang, Chengjie and Liu, Bingbing and Bai, Dongfeng}, journal = {arXiv preprint arXiv:2605.14315}, year = {2026}, }