Lesson 6 of 82 min read

Train: from verified dataset to model

A verified, versioned dataset flows into your training pipeline. You train on ground truth, not raw sensor dumps — and the format meets your framework where it lives.

Everything in the loop so far exists to make this moment boring. Training should be the easy part: point your pipeline at a dataset you trust and run. The reason it usually isn't easy is that the data was never really ready — labels were inconsistent, provenance was lost, and the export was a fragile script nobody wanted to touch.

By this stage, that work is done. You have a versioned, consensus-verified, traceable dataset. Now you move it into training.

Getting data into your pipeline

Avala produces datasets in the formats training frameworks expect. Create an export from a project or a curated slice:

avala exports create --project <uid> --format coco
avala exports wait <export-uid>

Supported export formats include Avala's native JSON, COCO, KITTI (2D and 3D), Pascal VOC, and YOLO — so the handoff to your existing training code is a known quantity, not a custom parser. You can drive the same flow from the SDK and use webhooks to kick off a training run automatically the moment an export completes.

From there, the framework guides show the rest of the path — loading an export into a PyTorch Dataset, or converting it into a Hugging Face datasets.Dataset and pushing it to your hub. The principle is the same regardless of framework: your model trains on verified ground truth, with a dataset version you can reproduce.

Train on ground truth, not raw data

This is the difference that compounds. A pipeline that streams raw recordings to a GPU is fast, but it's training on unlabeled bytes — the labels still have to come from somewhere, and if they're wrong, the model learns the error at scale. Because Avala's datasets are already verified 4D ground truth, the data going into training is the data you can defend.

It also means your dataset mixes are deliberate. The slice you curated in the last lesson — the exact scenarios and edge cases you chose — is what the model sees. When results come back, you can attribute them to specific, versioned data.

Where this is heading

Exports are the bridge today. The direction is to remove the bridge entirely: streaming verified datasets straight from Avala into training without a static export step, and first-class interoperability with the open robot-learning ecosystem so an Avala-curated dataset is training-ready by default. The closed loop is most valuable when the distance from "verified label" to "training batch" is as short as possible.

The payoff

You've now taken one recording from raw multi-sensor capture all the way to a trained model — through a 4D scene, verified ground truth, and a curated, reproducible dataset. One platform, one loop, one API.

But the loop isn't finished when the model ships. The most valuable data is the data the deployed model gets wrong — and capturing it is what makes the next turn of the flywheel better than the last.

Next: Deploy & close the loop →

PreviousCurate & manage NextDeploy & close the loop