TurboVGGT: Fast Visual Geometry Reconstruction
with Adaptive Alternating Attention

David Huang^1,2,*,† Guile Wu^1,† Chengjie Huang¹ Bingbing Liu³ Dongfeng Bai¹

¹Huawei Noah's Ark Lab ²University of Toronto ³Foundation Model Department, Huawei

^*David Huang contributed to this work during an internship at Huawei Canada. ^†Equal contribution.

Figure 1. TurboVGGT achieves fast multi-view 3D reconstruction while maintaining competitive reconstruction quality.

Abstract

Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods.

Method

Figure 2. The overall framework of TurboVGGT.

Contributions

We present a novel visual geometry transformer with adaptive alternating attention blocks for fast multi-view 3D reconstruction.
We propose an adaptive sparsity selection mechanism for visual geometry transformers, which adaptively selects different sparsity ratios for different frames across different layers.
We propose adaptive sparse global attention for visual geometry transformers, which learns representative tokens to model global geometry relationships.
We conduct extensive experiments on multiple 3D reconstruction benchmarks and demonstrate the superiority of the proposed approach over state-of-the-art methods.

Results

Figure 3. Qualitative comparison of point cloud reconstruction, camera pose estimation, and depth estimation.

Figure 4. Qualitative results for point cloud reconstruction.

Figure 5. Qualitative results for camera pose estimation.

Supplementary Depth Estimation Visualization

Figure 6. Qualitative results for depth estimation.

BibTeX

If you find this work useful, please cite our paper:

@article{huang2026turbovggt, title = {TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention}, author = {Huang, David and Wu, Guile and Huang, Chengjie and Liu, Bingbing and Bai, Dongfeng}, journal = {arXiv preprint arXiv:2605.14315}, year = {2026}, }