Data Connectors & Import Pipelines

Avala provides multiple ways to ingest data depending on your dataset size, infrastructure, and automation needs. This page covers each import method, when to use it, and how to build automated data pipelines.

Import Methods Overview

Method	Best For	Max Size	Automation	Setup
Mission Control upload	Small datasets, one-off imports	10 GB per user	Manual	None
Presigned URL upload	Programmatic uploads from any language	10 GB per user	Full	API key
Cloud storage (S3/GCS)	Large datasets, zero-copy access	Unlimited	Full	Bucket config
MCAP import	Multi-sensor robotics data	10 GB per file	Full	API key
SDK bulk upload	Medium datasets with progress tracking	10 GB per user	Full	SDK installed

Mission Control Upload

The simplest way to get data into Avala. Drag and drop files directly in the web interface.

Steps

Go to Mission Control > Datasets > Create Dataset
Name your dataset and select the data type
Drag files into the upload area or click Browse
Wait for processing to complete

Limitations

Browser-based upload is limited by your connection speed and browser memory
Not suitable for datasets with more than 1,000 files
No resumable uploads — interrupted uploads must restart

For datasets larger than a few hundred files, use the SDK or presigned URL approach instead.

Presigned URL Upload

Presigned URLs let you upload files directly to Avala’s storage from any HTTP client. This is the most flexible programmatic upload method and works from any language or tool that can make HTTP requests.

How It Works

Request a presigned upload URL from the Avala API
Upload your file directly to the presigned POST URL
Create the dataset from the uploaded files

Example: Upload with cURL

# Step 1: Get a presigned upload URL
curl -X POST https://api.avala.ai/api/v1/datasets/manual-upload/file-upload-url/ \
  -H "X-Avala-Api-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_name": "robot-run-001",
    "file_path_in_dataset": "frame_001.jpg",
    "content_length": 1024000
  }'

# Response:
# { "method": "POST", "url": "https://s3.amazonaws.com/...", "fields": { ... } }

# Step 2: Upload the file with the returned POST fields
curl -X POST "https://s3.amazonaws.com/..." \
  -F "key=..." \
  -F "policy=..." \
  -F "file=@frame_001.jpg"

# Step 3: Create the dataset from uploaded files
curl -X POST https://api.avala.ai/api/v1/datasets/manual-upload/ \
  -H "X-Avala-Api-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "robot-run-001",
    "slug": "robot-run-001",
    "data_type": "image",
    "visibility": "private",
    "industry": 123,
    "license": 456
  }'

Example: Upload with the CLI

# Handles presigned URLs, direct storage upload, and dataset creation.
# Limit: 10 GiB total per local-upload dataset.
avala datasets upload \
  --source data/images \
  --name robot-run-001 \
  --slug robot-run-001 \
  --data-type image \
  --industry 123 \
  --license 456

Cloud Storage Integration

For large-scale datasets, connect your own S3 or GCS bucket so Avala reads data directly from your storage — no file transfers, no copies.

When to Use Cloud Storage

Scenario	Use Cloud Storage?
Dataset > 10,000 items	Yes
Dataset > 100 GB total	Yes
Data must stay in your infrastructure	Yes
Quick prototype with < 100 items	No — direct upload is faster
Data is spread across multiple buckets	Yes — connect multiple storage configs

Setup

Configure your bucket with the appropriate IAM policy (see Cloud Storage guide)
Add the storage configuration in Mission Control > Settings > Storage
Create a dataset and select your connected storage as the data source
Reference items by their storage paths

Example: Create Dataset from S3

from avala import Client

client = Client()

# Create a dataset backed by cloud storage
dataset = client.datasets.create(
    name="driving-data-2026-02",
    data_type="image",
    storage_config_uid="stg_your_config_uid"
)

# Register items by their S3 paths
items = [
    {"path": "s3://your-bucket/captures/frame_001.jpg"},
    {"path": "s3://your-bucket/captures/frame_002.jpg"},
    {"path": "s3://your-bucket/captures/frame_003.jpg"},
]

for item in items:
    client.datasets.create_item(
        dataset_uid=dataset.uid,
        source_url=item["path"]
    )

Cloud storage datasets load faster in the annotation editor because images are served directly from your bucket’s region, avoiding cross-region transfers.

MCAP Import

MCAP files contain synchronized multi-sensor data (cameras, LiDAR, IMU). Avala parses MCAP files to extract and align sensor streams for annotation.

Supported Message Types

Message Type	Description
`sensor_msgs/Image`	Camera images
`sensor_msgs/CompressedImage`	Compressed camera images
`sensor_msgs/PointCloud2`	LiDAR point clouds
`sensor_msgs/Imu`	IMU readings
`geometry_msgs/TransformStamped`	Sensor transforms (TF)
`sensor_msgs/NavSatFix`	GPS coordinates

Import Workflow

Upload MCAP files via the SDK or presigned URLs
Avala processes the file, extracting camera frames and point cloud scans
Sensor streams are synchronized by timestamp
Camera images and projected LiDAR data appear together in the annotation editor

For detailed MCAP setup, see the MCAP / ROS integration guide.

Building Import Pipelines

For production workflows, automate data ingestion so new data flows into Avala as it is collected.

Pipeline Architecture

Data Source                  Avala
┌──────────────┐            ┌─────────────────┐
│ Collection   │            │ Dataset          │
│ System       │──upload──→ │ (items created)  │
│ (cameras,    │            │                  │
│  sensors)    │            │ Project          │
└──────────────┘            │ (tasks assigned) │
                            └────────┬────────┘
                                     │
                              webhook │
                                     ▼
                            ┌─────────────────┐
                            │ Your Pipeline    │
                            │ (export, train)  │
                            └─────────────────┘

Example: Automated Ingestion with Webhooks

Combine the CLI upload with webhooks to build a fully automated pipeline:

# upload_pipeline.py
import os
import subprocess
from datetime import datetime

INDUSTRY_ID = os.environ["AVALA_INDUSTRY_ID"]
LICENSE_ID = os.environ["AVALA_LICENSE_ID"]

def ingest_batch(data_directory: str) -> str:
    """Upload a directory snapshot and create a new Avala dataset."""
    batch_name = f"camera-batch-{datetime.utcnow():%Y%m%d-%H%M%S}"

    subprocess.run(
        [
            "avala", "datasets", "upload",
            "--source", data_directory,
            "--name", batch_name,
            "--slug", batch_name,
            "--data-type", "image",
            "--industry", INDUSTRY_ID,
            "--license", LICENSE_ID,
        ],
        check=True,
    )
    return batch_name

if __name__ == "__main__":
    dataset_name = ingest_batch("/data/incoming")
    print(f"Created dataset {dataset_name}")

Schedule this script with cron, Airflow, or any task scheduler to periodically ingest new data.

Example: Watch Directory and Upload

#!/bin/bash
# watch_and_upload.sh - Upload new files as they appear

WATCH_DIR="/data/incoming"
INDUSTRY_ID="123"
LICENSE_ID="456"

inotifywait -m -e create "$WATCH_DIR" --format '%f' | while read filename; do
    if [[ "$filename" == *.jpg || "$filename" == *.png ]]; then
        dataset_name="camera-file-$(date +%Y%m%d-%H%M%S)"
        avala datasets upload \
          --source "$WATCH_DIR/$filename" \
          --name "$dataset_name" \
          --slug "$dataset_name" \
          --data-type image \
          --industry "$INDUSTRY_ID" \
          --license "$LICENSE_ID"
        echo "Created dataset from: $filename"
    fi
done

Choosing an Import Method

Use this decision tree to select the right approach:

Question	If Yes	If No
Fewer than 100 files?	Mission Control upload	Continue
Data already in S3/GCS?	Cloud storage integration	Continue
MCAP or ROS bag files?	MCAP import	Continue
Need automation?	SDK bulk upload or presigned URLs	Mission Control upload
Using Python or TypeScript?	SDK bulk upload	Presigned URL (any language)

Next Steps

Cloud Storage

Detailed S3 and GCS configuration for bring-your-own-storage.

MCAP / ROS

Import multi-sensor recordings with camera, LiDAR, and IMU data.

Python SDK

Install the Python SDK and start uploading data programmatically.

Webhooks

Set up event notifications to trigger downstream pipelines.

​Import Methods Overview

​Mission Control Upload

​Steps

​Limitations

​Presigned URL Upload

​How It Works

​Example: Upload with cURL

​Example: Upload with the CLI

​Cloud Storage Integration

​When to Use Cloud Storage

​Setup

​Example: Create Dataset from S3

​MCAP Import

​Supported Message Types

​Import Workflow

​Building Import Pipelines

​Pipeline Architecture

​Example: Automated Ingestion with Webhooks

​Example: Watch Directory and Upload

​Choosing an Import Method

​Next Steps

Import Methods Overview

Mission Control Upload

Steps

Limitations

Presigned URL Upload

How It Works

Example: Upload with cURL

Example: Upload with the CLI

Cloud Storage Integration

When to Use Cloud Storage

Setup

Example: Create Dataset from S3

MCAP Import

Supported Message Types

Import Workflow

Building Import Pipelines

Pipeline Architecture

Example: Automated Ingestion with Webhooks

Example: Watch Directory and Upload

Choosing an Import Method

Next Steps