ICRA 2026 Accepted

ScaleMaster Dataset

A Large-Scale Dataset and Benchmark for Evaluating Scale Consistency in Complex Indoor Environments

Hyoseok Ju1, Bokeon Suh1, Giseop Kim1†
1Daegu Gyeongbuk Institute of Science and Technology, Daegu, Republic of Korea (DGIST)
📄 Paper 💻 GitHub 📥 Download Dataset

Abstract

Recent advances in deep monocular visual SLAM have achieved impressive accuracy and dense reconstruction capabilities, yet their robustness to scale inconsistency in large- scale indoor environments remains largely unexplored. Existing benchmarks are limited to room-scale or structurally simple settings, leaving critical issues of intra-session scale drift and inter-session scale ambiguity insufficiently addressed. To fill this gap, we introduce the ScaleMaster Dataset, the first benchmark explicitly designed to evaluate scale consistency under challenging scenarios such as multi-floor structures, long trajectories, repetitive views, and low-texture regions. We systematically analyze the vulnerability of state-of-the-art deep monocular visual SLAM systems to scale inconsistency, providing both qualitative and quantitative evaluations. Crucially, our analysis extends beyond traditional trajectory metrics to include a direct map-to-map quality assessment using metrics like Chamfer distance against high-fidelity 3D ground truth. Our results reveal that while these traditional methods demonstrate strong performance on existing benchmarks, they suffer from severe scale-related failures in realistic, large-scale indoor environments. By releasing the ScaleMaster dataset and baseline results, we aim to establish a foundation for future research toward developing scale-consistent and reliable visual SLAM systems.

Dataset Statistics

25 Sequences
3.8 km+ Total Length
10+ Environments
RGB+D+IMU Data Type

Baseline SE(3) Pose Refinement

Raw ARKit VIO poses accumulate drift over long trajectories. We correct this by manually identifying loop closures and optimizing the pose graph to produce drift-corrected ground-truth poses.

Pipeline Overview

  1. ARKit VIO odometry — Raw 6-DoF poses captured from Apple ARKit; drift accumulates over large-scale trajectories.
  2. HLoc retrieval — NetVLAD selects loop closure candidates; SuperPoint + LightGlue verifies geometric consistency via Essential-matrix.
  3. Manual verification — Each candidate is accepted or rejected interactively, ensuring only reliable loop pairs are used.
  4. SE(3) relative pose estimation — Depth Anything V3 provides metric depth to compute the 6-DoF transform between accepted loop pairs.
  5. GTSAM PGO — Pose graph optimization jointly minimizes odometry and loop closure residuals, yielding drift-corrected trajectories.

Before / After Pose Graph Optimization

Before/After PGO 1 Before/After PGO 2 Before/After PGO 3 Before/After PGO 4

Red = trajectory before PGO  |  Yellow = trajectory after PGO  |  Green = verified loop closures

Rerun Visualization

Rerun visualization

3D trajectory with camera frustums, loop closure edges, and RGB images in Rerun.

Demo

Dataset Sequences

Click on any sequence to explore in 3D

📚 Library Environments (9 sequences)

🏢 Large Hall Environments (5 sequences)

🚗 Parking & Basement (3 sequences)

🪜 Stairs & Station (3 sequences)

🏠 Indoor Rooms (5 sequences)

Download Dataset

📥 Download

Survey for Download
Individual sequences
Browse Dataset

💻 Git Clone

Clone entire dataset
with Git LFS
Show Command

📖 Documentation

Dataset format
Usage examples
View on GitHub

Citation

@inproceedings{ju2026scalemaster,
  title={Have We Mastered Scale in Deep Monocular Visual SLAM? The ScaleMaster Dataset and Benchmark},
  author={Ju, Hyoseok and Suh, Bokeon and Kim, Giseop},
  booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
  year={2026},
  note={To appear}
}
Loading Point Cloud...
0%