Lesson 3 of 82 min read

Ingest & Visualize: from MCAP to a 4D scene

Upload a raw multi-sensor recording and watch it become a synchronized 4D scene in the browser — point clouds, multi-camera playback, and splats on one timeline.

The loop starts with raw data, and raw data is messy. A single recording can carry three camera feeds at 30 Hz, a LiDAR sensor at 10 Hz, an IMU at 200 Hz, and GPS at 1 Hz — each on its own clock, each in its own schema. Before you can do anything useful, those streams have to be ingested, time-aligned, and made explorable.

Getting data in

Avala ingests the formats robotics teams actually produce. MCAP is the recommended container; ROS 1 .bag and ROS 2 .db3 are converted to MCAP automatically on upload. You can also bring plain images, video, and point-cloud files (PCD/PLY), or connect your own S3/GCS bucket for zero-copy ingestion of data you already store.

From the terminal:

avala datasets upload ./recordings/run_42.mcap --dataset <uid>

Or wire up a cloud connector and let new recordings flow in automatically. For large files, ingestion runs server-side — it indexes channels and messages so the recording becomes queryable, not merely stored.

A good recording makes everything downstream easier. Include your transforms (/tf and /tf_static), use compressed images, and embed your message schemas so the platform can interpret custom types without guesswork.

Seeing it in 4D

Once ingested, a recording opens as a synchronized scene in the browser — no desktop app, no downloads. The viewer is GPU-accelerated via WebGPU, and it does more than play back video side by side:

3D point clouds render natively with level-of-detail and frustum culling, so million-point LiDAR sweeps stay interactive. Color them by intensity, height, label, or by projecting the camera image onto the cloud.
Multi-camera playback stays frame-locked to the 3D scene, with calibrated projection between camera and LiDAR space — pinhole and fisheye/double-sphere models included.
Gaussian-splat reconstructions let you view a scene as a continuous photorealistic 3D field, not just discrete points.
A unified timeline scrubs every modality at once, so you can step to the exact moment something went wrong and see all sensors agree.

The point of the 4D view isn't just debugging. It's that a single spatial context — where camera, LiDAR, and transforms are reconciled — is the foundation that makes the next stage, annotation, deterministic. When you draw a label here, it's anchored in a scene that's already internally consistent.

What "good" looks like

By the end of ingest, you should be able to:

open a recording and scrub all sensors on one timeline,
confirm your calibration by checking that cuboids and points project cleanly onto the camera images, and
spot the frames that matter — the edge cases worth labeling.

That last point is the bridge to the rest of the loop. You're no longer staring at a 4 GB folder; you're looking at a scene, and you've found the moments that need ground truth.

Next: Annotate: auto-label + human-verified ground truth →

PreviousThe Ground-Truth Loop NextAnnotate