CAM3R: Camera-Agnostic Model for 3D Reconstruction

CAM3R teaser showing two-view and multi-view 3D reconstruction — **Fig. 1 —** CAM3R provides a robust, feed-forward 3D reconstruction for two-view and multi-view scenarios across disparate optical manifolds, including pinhole, fisheye and panoramic cameras, where recent 3D foundation models fail. Above, we highlight CAM3R's performance on unseen scenes from the 360Loc dataset, visualizing both raw two-view predictions and multi-view reconstructions.

Abstract

Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.

Overview

CAM3R pipeline showing shared ray module and cross-view module — **Fig. 2 — CAM3R Overview.** Given an input image pair *(I₁, I₂)*, the framework operates through two parallel streams. The **Shared Ray Module** recovers the internal camera geometry by regressing Spherical Harmonic coefficients to reconstruct continuous ray directional fields d_i. Simultaneously, the **Cross-view Module** extracts features and utilizes a dual-block transformer decoder to facilitate information exchange between the two views. Specialized DPT heads then regress radial distances r_i with confidence maps *σ_i*, while a Relative Pose Network estimates the rigid transformation *P_{2 → 1}*. The local pointmaps X^i,i are generated by fusing rays d_i with radial distances r_i. Finally, the second view is transformed into the reference coordinate frame of the first view via *P_{2 → 1}* to produce the globally aligned 3D reconstruction.

Evaluation

Two-view Reconstruction

**Table 1 — Two-view results.** Accuracy at 15° (RRA/RTA).
Model	2D3DS		MegaDepth		CO3Dv2		360Loc		ADT
Model	RRA	RTA	RRA	RTA	RRA	RTA	RRA	RTA	RRA	RTA
DUSt3R	10.6	6.0	95.6	80.8	94.7	43.1	0.0	0.0	91.0	63.6
MASt3R	18.3	9.3	69.7	56.4	98.4	33.4	39.8	5.3	96.6	63.5
Pow3R	7.5	6.0	96.2	74.2	95.8	38.3	0.0	0.0	96.6	79.2
VGGT	11.8	11.0	98.0	88.2	90.9	29.4	37.8	11.1	92.7	82.9
π³	16.8	11.4	99.8	93.3	90.7	22.7	38.5	13.0	97.5	93.8
CAM3R	97.7	94.3	96.8	94.2	97.5	88.2	96.0	91.0	99.0	95.0

Qualitative two-view 3D reconstructions — **Fig. 3 — Qualitative Two-View Reconstructions.** Visualization of 3D point clouds for image pairs across diverse optical manifolds (panorama, fisheye, pinhole). Despite extreme radial distortions and camera geometries, relative poses are accurately recovered and structural consistency is maintained. *Note this is the raw output of the network.*

Multi-view Reconstruction

Qualitative multi-view 3D reconstructions — **Fig. 4 — Qualitative Multi-View Reconstructions.** Global camera trajectories and dense point clouds recovered from unstructured image pools across diverse datasets. Despite high radial distortion and lack of scenegraph information, globally consistent poses and structural geometry are maintained through the Ray-Aware Global Alignment, effectively mitigating trajectory drift and scale ambiguity.

**Table 2 — Multi-view results.** Accuracy at 30° and mAA.
Model	2D3DS			MegaDepth			360Loc
Model	RRA	RTA	mAA	RRA	RTA	mAA	RRA	RTA	mAA
VGGT	31.8	34.4	7.6	100	97.4	68.8	47.9	50.8	19.5
π³	40.0	35.8	9.6	100	98.4	73.4	48.6	47.4	17.8
CAM3R	94.0	91.5	73.5	96.6	96.3	87.4	98.7	91.2	82.6

Additional Visualizations

Zero-shot multi-view reconstructions on Matterport, FIORD, and CO3Dv2 — **Fig. 5 — Zero-shot generalization to unseen datasets.** CAM3R successfully recovers dense 3D point clouds on Matterport, FIORD, and CO3Dv2 despite no multi-view training on these domains. While the Ray Module leverages prior exposure to Matterport's optical manifold via UniK3D initialization, the Cross-View Module demonstrates true zero-shot structural generalization. Notably, in the FIORD dataset, CAM3R unwraps extreme 2D fisheye aberrations, strictly preserving rectilinear structures (e.g., straight walls and sharp corners) in 3D space. To simulate unconstrained real-world environments, these scenes contain a heterogeneous mix of panoramic, fisheye, and pinhole captures. Multiple viewing angles are provided to highlight the structural integrity of the reconstructions.

In-domain multi-view reconstructions — **Fig. 6 — In-domain multi-view 3D reconstructions.** Extended qualitative results from the 360Loc, ADT, and MegaDepth test splits demonstrate CAM3R's robustness. In 360Loc, the model reconstructs expansive concourse regions despite the prevalence of smooth, highly reflective, and textureless surfaces that typically break traditional matching heuristics. For ADT, which features enclosed room scenes with dense, short-baseline egocentric fisheye captures, our rigorous scene-graph pruning yield a coherent pointcloud. Multiple viewing angles are provided to illustrate the dense coverage and structural consistency of the fused multi-modal outputs.

Two-view reconstructions with cross-modal correspondences — **Fig. 7 — Raw two-view reconstructions and cross-modal correspondences.** We visualize constituent image pairs from the multi-view scenes evaluated in Fig. 6. CAM3R demonstrates robustness in recovering two-view relative geometry not only for homogeneous pairs (panorama-panorama, fisheye-fisheye, pinhole-pinhole) but crucially across highly heterogeneous projection models (e.g., panorama-pinhole and panorama-fisheye in 360Loc). We overlay 3D-3D Mutual Nearest Neighbor (MNN) matches utilized during our graph pruning phase, color-coded by the network's predicted distance (brighter yellow indicates stronger matches and closer points in 3D). *Note: These visualizations represent the raw, feed-forward output of the two-view network prior to Ray-Aware Global Alignment.*

Acknowledgements

We thank Anand Bhattad for helpful discussions and valuable feedback. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 140D0423C0076. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.