Lesson 4 of 83 min read

Annotate: auto-label + human-verified ground truth

The stage that makes the loop a loop. Models trained on your data auto-label the bulk; human experts verify the hard cases through consensus. The output is deterministic 4D ground truth.

This is the chapter that separates a data engine from a data viewer. Everything so far — ingesting, visualizing — is about seeing your data. Annotation is about trusting it. It's where a recording becomes ground truth a model can safely learn from.

Two passes: machines first, experts on the hard part

Labeling every frame by hand doesn't scale, and labeling every frame by model isn't reliable enough for safety-critical AI. Avala does both, in the right order.

Auto-labeling first. Models — foundation models like SAM3, plus agents trained on your data — segment and track objects across frames, fit 3D cuboids, estimate depth and pose, and pre-annotate the bulk of a recording. On mature projects this handles the large majority of annotations automatically.

Human experts on the exceptions. Foundation models are not yet reliable on the genuinely hard cases — occlusions, rare objects, ambiguous geometry. Those route to a managed workforce of domain experts. They don't re-label everything; they resolve the cases the models flag as uncertain.

The result is the data flywheel in miniature: auto-label → expert-verify → retrain the auto-labeler on the corrections → repeat. The more you run it, the more the machines handle.

Why it's 4D, not 2D

Generalist platforms draw 2D boxes on flat frames. That breaks in an embodied context, where the same object must be consistent across every camera and every moment. Avala's annotation happens against the unified 4D reconstruction from the previous lesson, so:

a 3D cuboid is placed once and stays geometrically consistent across frames, with interpolation between keyframes and a stable tracking ID over time,
a segmentation mask maps coherently between the point cloud and every camera image, and
labels don't suffer the calibration drift that plagues home-grown pipelines stitched from separate tools.

The toolset spans what Physical AI actually needs: 2D boxes and polygons, 3D cuboids with tracking and interpolation, 2D/3D segmentation, polylines and lanes, classification, and keypoints — plus an in-browser Gaussian-splat editor for working directly with reconstructed scenes.

Quality you can audit

Trust requires evidence. Avala routes the same data to multiple independent annotators and measures inter-annotator agreement — consensus — to catch ambiguous cases instead of silently shipping them. This is glass-box QA: every label is traceable to who made it, why, and under which policy. When a model fails later, you can follow the label back to its source.

In code

You can drive auto-labeling and exports programmatically. Kick off an auto-label job from the CLI:

avala autolabel create --project <uid> --model sam3

Or trigger it from the SDK and let webhooks notify you when consensus is reached. The point is that annotation isn't a manual side-quest bolted onto your pipeline — it's an API-driven stage inside the loop.

By the end of this stage you have what no viewer or storage layer can give you on its own: a recording turned into verified, four-dimensional ground truth, ready to curate and train on.

Next: Curate & manage: datasets, slices, and lineage →

PreviousIngest & Visualize NextCurate & manage