Skip to main content
This guide shows how to train PyTorch models on Avala data. The recommended path streams annotated frames directly from Avala into a DataLoaderno exports.create, no archive download, no ETL. A static-export path is also documented for when you want a frozen snapshot.

Prerequisites

pip install "avala[torch]"
This installs the SDK plus the torch extra (PyTorch + Pillow). avala.torch provides two datasets that page lazily over your dataset’s items — each item already carries its presigned media URL and inline annotations, so a training run reads exactly the frames it needs without building an export.
from avala import Client
from avala.torch import AvalaIterableDataset
from torch.utils.data import DataLoader

client = Client()  # reads AVALA_API_KEY

# owner = your org slug, slug = the dataset slug (see client.datasets.list())
dataset = AvalaIterableDataset(
    client,
    "my-org",
    "my-dataset",
    decode_images=True,  # fetch + decode each frame to a PIL image under sample["image"]
)

# Samples are dicts (with a PIL image when decode_images=True), which the default
# collate can't batch — pass a collate_fn, or a transform that returns tensors (below).
loader = DataLoader(dataset, batch_size=8, num_workers=4, collate_fn=lambda batch: batch)

for batch in loader:
    for sample in batch:
        image, annotations = sample["image"], sample["annotations"]
        ...
Each sample is a dict:
KeyDescription
uidItem UID
keyItem key / filename
urlPresigned media URL
annotationsInline annotation payload
export_snippetThe same data the export archive is built from
metadataArbitrary item metadata
imageDecoded PIL.Image (only when decode_images=True)

Apply a transform

Pass a transform callable to shape each sample into model-ready tensors:
import torchvision.transforms as T

to_tensor = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])


def to_example(sample):
    image = to_tensor(sample["image"])
    boxes, labels = [], []
    for obj in (sample["annotations"] or {}).get("objects", []):
        if obj.get("type") == "bounding_box":
            c = obj["coordinates"]
            boxes.append([c["x"], c["y"], c["x"] + c["width"], c["y"] + c["height"]])
            labels.append(obj["label"])
    return image, {"boxes": boxes, "labels": labels}


dataset = AvalaIterableDataset(client, "my-org", "my-dataset", decode_images=True, transform=to_example)

Multi-worker and Distributed Data Parallel

AvalaIterableDataset shards automatically:
  • DataLoader workers — each num_workers worker reads a disjoint slice; no duplicates.
  • DDP — rank/world size are read from torch.distributed when initialized, so every replica streams its own shard. Override explicitly with rank= / world_size= if you manage distribution yourself.
# Under torchrun / DDP, this "just works" — each rank gets a disjoint shard:
dataset = AvalaIterableDataset(client, "my-org", "my-dataset", decode_images=True)
loader = DataLoader(dataset, batch_size=8, num_workers=4)

Need shuffling or indexed access?

Use the map-style AvalaDataset. It materializes the item list up front (one cursor walk) so it supports len() and random access, then fetches each item on access:
from avala.torch import AvalaDataset

dataset = AvalaDataset(client, "my-org", "my-dataset", decode_images=True)
loader = DataLoader(dataset, batch_size=8, shuffle=True)
print(f"Training on {len(dataset)} samples")
For large datasets, prefer AvalaIterableDataset for streaming throughput.

Static export (alternative)

When you want a reproducible, frozen snapshot (e.g. to archive a training set), create an export and load it from disk:
import json
import time
import requests
from pathlib import Path
from PIL import Image
from torch.utils.data import Dataset, DataLoader

from avala import Client

client = Client()

# 1. Build + download an export
export = client.exports.create(project="proj_abc123")
export = client.exports.wait(export.uid)  # blocks until ready
Path("export.json").write_bytes(requests.get(export.download_url).content)


class AvalaExportDataset(Dataset):
    """PyTorch dataset backed by a downloaded Avala JSON export."""

    def __init__(self, export_path: str, images_dir: str, transform):
        self.annotations = json.loads(Path(export_path).read_text())
        self.images_dir = Path(images_dir)
        self.transform = transform

    def __len__(self):
        return len(self.annotations)

    def __getitem__(self, idx):
        ann = self.annotations[idx]
        image = self.transform(Image.open(self.images_dir / ann["file_name"]).convert("RGB"))
        return image, ann.get("annotations", [])

Next Steps