CAM3R: Camera-Agnostic Model for 3D Reconstruction

Namitha Guruprasad1Abhay Yadav1Cheng Peng2Rama Chellappa1

1Johns Hopkins University, USA,  2University of Virginia, USA

CAM3R teaser showing two-view and multi-view 3D reconstruction
Fig. 1 — CAM3R provides a robust, feed-forward 3D reconstruction for two-view and multi-view scenarios across disparate optical manifolds, including pinhole, fisheye and panoramic cameras, where recent 3D foundation models fail. Above, we highlight CAM3R's performance on unseen scenes from the 360Loc dataset, visualizing both raw two-view predictions and multi-view reconstructions.

Abstract

Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.

Overview

CAM3R pipeline showing shared ray module and cross-view module
Fig. 2 — CAM3R Overview. Given an input image pair (I1, I2), the framework operates through two parallel streams. The Shared Ray Module recovers the internal camera geometry by regressing Spherical Harmonic coefficients to reconstruct continuous ray directional fields di. Simultaneously, the Cross-view Module extracts features and utilizes a dual-block transformer decoder to facilitate information exchange between the two views. Specialized DPT heads then regress radial distances ri with confidence maps σi, while a Relative Pose Network estimates the rigid transformation P2 → 1. The local pointmaps Xi,i are generated by fusing rays di with radial distances ri. Finally, the second view is transformed into the reference coordinate frame of the first view via P2 → 1 to produce the globally aligned 3D reconstruction.

Evaluation

Two-view Reconstruction

Model 2D3DS MegaDepth CO3Dv2 360Loc ADT
RRARTA RRARTA RRARTA RRARTA RRARTA
DUSt3R 10.66.0 95.680.8 94.743.1 0.00.0 91.063.6
MASt3R 18.39.3 69.756.4 98.433.4 39.85.3 96.663.5
Pow3R 7.56.0 96.274.2 95.838.3 0.00.0 96.679.2
VGGT 11.811.0 98.088.2 90.929.4 37.811.1 92.782.9
π³ 16.811.4 99.893.3 90.722.7 38.513.0 97.593.8
CAM3R 97.794.3 96.894.2 97.588.2 96.091.0 99.095.0
Table 1 — Two-view results. Accuracy at 15° (RRA/RTA).
Qualitative two-view 3D reconstructions
Fig. 3 — Qualitative Two-View Reconstructions. Visualization of 3D point clouds for image pairs across diverse optical manifolds (panorama, fisheye, pinhole). Despite extreme radial distortions and camera geometries, relative poses are accurately recovered and structural consistency is maintained. Note this is the raw output of the network.

Multi-view Reconstruction

Qualitative multi-view 3D reconstructions
Fig. 4 — Qualitative Multi-View Reconstructions. Global camera trajectories and dense point clouds recovered from unstructured image pools across diverse datasets. Despite high radial distortion and lack of scenegraph information, globally consistent poses and structural geometry are maintained through the Ray-Aware Global Alignment, effectively mitigating trajectory drift and scale ambiguity.
Model 2D3DS MegaDepth 360Loc
RRARTAmAA RRARTAmAA RRARTAmAA
VGGT 31.834.47.6 10097.468.8 47.950.819.5
π³ 40.035.89.6 10098.473.4 48.647.417.8
CAM3R 94.091.573.5 96.696.387.4 98.791.282.6
Table 2 — Multi-view results. Accuracy at 30° and mAA.

Additional Visualizations

Zero-shot multi-view reconstructions on Matterport, FIORD, and CO3Dv2
Fig. 5 — Zero-shot generalization to unseen datasets. CAM3R successfully recovers dense 3D point clouds on Matterport, FIORD, and CO3Dv2 despite no multi-view training on these domains. While the Ray Module leverages prior exposure to Matterport's optical manifold via UniK3D initialization, the Cross-View Module demonstrates true zero-shot structural generalization. Notably, in the FIORD dataset, CAM3R unwraps extreme 2D fisheye aberrations, strictly preserving rectilinear structures (e.g., straight walls and sharp corners) in 3D space. To simulate unconstrained real-world environments, these scenes contain a heterogeneous mix of panoramic, fisheye, and pinhole captures. Multiple viewing angles are provided to highlight the structural integrity of the reconstructions.
In-domain multi-view reconstructions
Fig. 6 — In-domain multi-view 3D reconstructions. Extended qualitative results from the 360Loc, ADT, and MegaDepth test splits demonstrate CAM3R's robustness. In 360Loc, the model reconstructs expansive concourse regions despite the prevalence of smooth, highly reflective, and textureless surfaces that typically break traditional matching heuristics. For ADT, which features enclosed room scenes with dense, short-baseline egocentric fisheye captures, our rigorous scene-graph pruning yield a coherent pointcloud. Multiple viewing angles are provided to illustrate the dense coverage and structural consistency of the fused multi-modal outputs.
Two-view reconstructions with cross-modal correspondences
Fig. 7 — Raw two-view reconstructions and cross-modal correspondences. We visualize constituent image pairs from the multi-view scenes evaluated in Fig. 6. CAM3R demonstrates robustness in recovering two-view relative geometry not only for homogeneous pairs (panorama-panorama, fisheye-fisheye, pinhole-pinhole) but crucially across highly heterogeneous projection models (e.g., panorama-pinhole and panorama-fisheye in 360Loc). We overlay 3D-3D Mutual Nearest Neighbor (MNN) matches utilized during our graph pruning phase, color-coded by the network's predicted distance (brighter yellow indicates stronger matches and closer points in 3D). Note: These visualizations represent the raw, feed-forward output of the two-view network prior to Ray-Aware Global Alignment.

Acknowledgements

We thank Anand Bhattad for helpful discussions and valuable feedback. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 140D0423C0076. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.