DataLoader — no exports.create, no archive download, no ETL. A static-export path is also documented for when you want a frozen snapshot.
Prerequisites
torch extra (PyTorch + Pillow).
Stream straight from Avala (recommended)
avala.torch provides two datasets that page lazily over your dataset’s items — each item already carries its presigned media URL and inline annotations, so a training run reads exactly the frames it needs without building an export.
| Key | Description |
|---|---|
uid | Item UID |
key | Item key / filename |
url | Presigned media URL |
annotations | Inline annotation payload |
export_snippet | The same data the export archive is built from |
metadata | Arbitrary item metadata |
image | Decoded PIL.Image (only when decode_images=True) |
Apply a transform
Pass atransform callable to shape each sample into model-ready tensors:
Multi-worker and Distributed Data Parallel
AvalaIterableDataset shards automatically:
- DataLoader workers — each
num_workersworker reads a disjoint slice; no duplicates. - DDP — rank/world size are read from
torch.distributedwhen initialized, so every replica streams its own shard. Override explicitly withrank=/world_size=if you manage distribution yourself.
Need shuffling or indexed access?
Use the map-styleAvalaDataset. It materializes the item list up front (one cursor walk) so it supports len() and random access, then fetches each item on access:
AvalaIterableDataset for streaming throughput.
Static export (alternative)
When you want a reproducible, frozen snapshot (e.g. to archive a training set), create an export and load it from disk:Next Steps
- Python SDK reference for all available methods
- Export API for export format details
- Avala + Hugging Face for Hugging Face Datasets integration