Lesson 5 of 82 min read

Curate & manage: datasets, slices, and lineage

Verified labels are only useful if you can find, slice, version, and trust them. Curation turns a pile of annotations into datasets you can reason about.

A model is a function of the data you train it on. So the question that decides model quality isn't only "are the labels correct?" — it's "which data, exactly, did this model learn from, and can I change that deliberately?" Curation is how you answer it.

The shape of your data

Avala organizes work around a small, consistent ontology:

Datasets are the raw material — images, video, LiDAR, MCAP sequences.
Projects bind a dataset to a task configuration (say, 3D cuboid annotation or lane polylines) and govern the lifecycle of work from pending through review and approval.
Tasks are the individual units of labeling, each with a status you can track.
Slices are saved, queryable subsets — your interface for "just the night scenes with pedestrians," or "every frame the deployed model was uncertain about."

This structure is what lets you treat data as something you compose, not something you dump.

Query, don't dig

Curation-before-training is the efficiency thesis of modern Physical AI: you get more from labeling the right 10% than from labeling everything. That only works if your data is queryable. Avala lets you filter and slice by scenario, attribute, confidence, and label content — so a dataset mix becomes a query you can express and re-run, not a folder you assemble by hand.

Practically, that means you can build a training set like "all left-turn scenarios with cyclists, plus every disengagement from last week" and keep it current as new data arrives.

Versioning and lineage

Verified data is an asset, and assets need provenance. Every label in Avala carries lineage — who produced it, which policy governed it, and how it was reviewed. Datasets are versioned, so a training run points at a specific, reproducible state of the data. When a model regresses, you can diff the data that changed, not just the code.

This is what "glass-box" means in practice. The opposite — opaque labels from an anonymous pipeline — is exactly what makes safety-critical debugging impossible.

In code

Curation is fully programmatic. List and inspect what you have:

from avala import Client

client = Client()
projects = client.projects.list()
for p in projects:
    print(p.name, p.status)

Define a slice, then hand it straight to the next stage. The output of curation is a clean, versioned, traceable dataset — the input training has been waiting for.

Next: Train: from verified dataset to model →

PreviousAnnotate NextTrain