From raw video to machine-readable reality.

The Problem

More video is not more insight. As AI moves from the digital screen to the physical world, we have hit a data wall. We are drowning in raw pixels but still lack fundamental, structured truth. In order to build high-level world models and teach machines to navigate our world, we must move beyond the limits of current data collection.

  • Raw Video is Unstructured Noise: AI models cannot be trained on raw as it is essentially just a stream of pixels. It lacks the granular, machine-readable structure AI needs to "understand" physical interactions.
  • The Simulation Paradox: Simulated environments are bounded by their creators. They consistently fail to capture the entropy of the real world and these messy, unpredictable edge cases need to be carefully accounted for when building truly intelligent world models.
  • Scaling Bottleneck: Human-in-the-loop labeling cannot keep pace with the exponential data demands of physical AI. It is too slow, too expensive, and physically impossible to scale to the petabytes required for world-class models.

Our Solution

DataLabs delivers Data-as-a-Service. We curate massive publicly available video datasets, like Creative Commons Youtube (approximately 49 million videos), run our AI pipeline to parse physical interactions and structure every frame, and output rich schemas including object tracking, actions, causal links, depth, and embeddings. One pipeline from public video to training-ready data.

Video Acquisition

We source millions of public videos within our legal framework spanning diverse real-world settings and use cases, depending on the needs and interests of our client labs.

AI Pipeline

Our pipeline parses each video into objects, actions, causal relationships, and temporal structure which is the fundamental training-ready structured data.

Schema Generation

Every frame becomes a machine-readable schema: visual, audio, physical, semantic. The labs then use these to teach behavior of the real-world.

The AI Pipeline

Our pipeline turns raw video into training-ready datasets at scale. Every run produces a full stack of structured signals, from legal and technical metadata through to pre-computed depth, causal graphs, and embeddings. Each dataset is designed for research and production use, with consistent schemas and provenance so you can train, evaluate, and deploy with confidence. We also customize the pipeline and outputted schemas depending on the requirements and specificities of the client lab. Our datasets include:

  • Who did what to whom, when, and with what result. Each event is an agent–action–object–target tuple with timestamps, so you can train on physical interactions instead of guessing from pixels. Useful for affordance learning, action recognition, and causal attribution.

    {
      "event_id": "act_001",
      "agent": "human_right_hand",
      "action": "thermal_bridge_application",
      "object": "obj_iron_01",
      "target": "obj_pcb_01",
      "start_ms": 190400,
      "end_ms": 198200
    }
  • Lens and camera pose for every frame. You get focal length, principal point, and distortion so models can recover 3D geometry and fuse multiple views without guessing. Handy for metric reconstruction and any task that needs real scale.

    "intrinsics": {
      "model": "pinhole",
      "params": { "fx": 1240.5, "fy": 1240.5,
                 "cx": 960, "cy": 540 }
    }
  • Events and states as nodes, cause–effect links as edges. The pipeline outputs a DAG so you can train models to answer “what if?” and plan, not just recognize. Edges can be necessity or sufficiency, so you know what must happen before what.

    "nodes": [{ "id": "N1", "type": "action",
      "label": "Flux Application" }],
    "edges": [{ "from": "N1", "to": "N3",
      "relation": "precondition" }]
  • Step-by-step reasoning and what-if questions tied to the video. We annotate why things happen and what would change if something else had happened, so models learn to explain and handle edge cases. Good for interpretability and robust decision-making.

    {
      "type": "counterfactual",
      "question": "What if solder touched the iron first?",
      "answer": "Would tin the tip but fail to wet the pad..."
    }
  • One place for license, checksums, resolution, and scene context. You can verify provenance, rerun experiments, and filter by legal or technical criteria without digging through raw files. Everything you need for compliance and reproducibility is in one manifest.

    "admin_and_legal": {
      "video_id": "skill.video_001",
      "license_type": "CC-BY-4.0-Commercial"
    },
    "technical_integrity": {
      "temporal": { "fps": 60 },
      "spatial": { "resolution": [1920, 1080] }
    }
  • Metric depth (how far each pixel is) and optical flow (how it moves frame to frame), with confidence where needed. Real-world geometry and motion, no sim or hand labels. Depth is in meters; flow gives you dense motion for dynamics and temporal consistency.

    // depth per pixel: [meters, confidence]
    [[0.45, 0.98], [0.12, 0.92]]
    
    // flow per frame
    { "mean_magnitude": 1.25, "flow_entropy": 0.45 }
  • Segmentation and tracking in one format: which pixels belong to which object, and the same ID across time. So models see who is where, when, and how they interact over the clip. Masks and bboxes at keyframes, with instance and semantic labels in a single representation.

    "dominant_objects": [
      { "id": "obj_iron_01", "label": "soldering_iron" },
      { "id": "obj_pcb_01", "label": "pcb", "static": true }
    ],
    "keyframes": [{
      "timestamp_ms": 190400,
      "objects": [{ "id": "obj_iron_01",
        "bbox": [0.42, 0.48, 0.08, 0.12] }]
    }]
  • Videos cut into segments by phase (e.g. setup, execution) with short summaries and milestones. Use it for goal-conditioned training or to benchmark “did the model get the procedure?” Each segment has start/end times and a complexity score so you can sample or weight by difficulty.

    {
      "segment_id": "seg_001",
      "type": "Preparation",
      "task_phase": "Setup",
      "summary": "Workspace organization and tool safety check.",
      "start_time_ms": 0,
      "end_time_ms": 45000
    }
  • Transcripts with timestamps, plus links from spoken words (e.g. “the tip”) to object IDs in the scene. Language and vision share the same vocabulary so you can train language-conditioned or audio–visual models. Optional acoustic events (clicks, clangs) are tagged where relevant.

    { "utterance_id": "u_001",
      "text": "Place the tip so it touches the pad and lead.",
      "start_ms": 122000, "end_ms": 126500 },
    "entities": [
      { "text": "tip", "object_id": "ID_001" },
      { "text": "pad", "object_id": "ID_002" }
    ]
  • Precomputed visual and audio vectors (e.g. SigLIP, CLAP) so you can skip feature extraction. Same pipeline, same schema—drop them into your trainer or retrieval stack and go. Visual is time-windowed (e.g. per second); audio is one vector per clip unless you ask otherwise.

Why DataLabs

Training-Ready Data

Get structured schemas and annotations instead of raw video. Focus on training and evaluation instead of building preprocessing from scratch.

Depth & Physics

Per-frame depth, optical flow, physics, and camera intrinsics from real video. The kind of geometric signal simulators struggle to reproduce.

Singular Pipeline

One automated pipeline turns public video into training-ready datasets. No stitching together multiple vendors or tools.

Causal Reasoning

Our Causal Graphs and DAGs allow you to move beyond pattern matching toward reasoning and planning.