Synthetic Data, Silent Failures: What Sim-to-VLA Conversion Gets Wrong

The robot moved confidently in the wrong direction.

Not random noise — that would have been easier to diagnose. The motion had geometric coherence. The arm was doing something, just not the something we had demonstrated. We had converted synthetic demonstration data from a sim-based SDG pipeline into LeRobot format for VLA fine-tuning. The script ran clean. Training started. Loss looked reasonable. And then the dry run told us something had gone deeply wrong upstream.

That moment forced a question we had not asked clearly enough: did our converted dataset actually mean what we thought it meant?

It had loaded fine. It had trained fine. Neither of those things, it turned out, was a sufficient answer.

Two Layers of Validation

Format conversion in robotics has two distinct validation layers, and most pipelines only check one:

Structural validity — correct folder layout, readable Parquet files, decodable videos, metadata present. This is what your conversion script implicitly checks when it does not crash.

Semantic validity — are actions in the right reference frame? Do they represent positions, velocities, or deltas? Are normalization statistics computed from the actual data distribution? Are episode boundaries correct? Are images in the right channel order?

The ecosystem has tooling for the first. Almost nothing exists for the second. Silent failures live entirely in the second layer — they train without complaint and surface only at deployment.

Sim pipelines were built for IL and RL workflows, not VLA ingestion. Their data contracts — reference frames, action representations, timestamp conventions — do not map cleanly to what a VLA training loop expects. LeRobot trusts you to put semantically correct data in the right fields. A conversion script is not a reformatter. It is a series of semantic claims about what your data means. The training loop accepts those claims without question.

What LeRobot Format Is Actually Promising

Before getting into what breaks, it helps to understand what each part of the format is contracting with the training loop — not as a folder structure tutorial, but as a mental model for why semantic errors are invisible.

The three folders have three distinct jobs:

data/ stores the frame-by-frame demonstration record. Every row is a timestep. Every column name is a semantic claim — when you write a value into action, you are asserting this is what the robot should learn to reproduce. The format does not enforce what that action represents. That is entirely on you.
videos/ stores synchronized camera streams as MP4 files. The training loop does not read raw tensors — it seeks into the video at a specific timestamp per frame. Two things matter beyond “the video exists”: codec (LeRobot expects H.264 via torchvision — AV1 and other sim pipeline defaults will decode incorrectly or fail), and timestamp precision (floating-point accumulation across episodes can cause seeks to invalid positions, a known ecosystem issue that surfaces after approximately 45 episodes in large datasets).
meta/ is where most silent failures originate. Three files, three contracts:
- info.json declares the schema, FPS, and path templates. The training loop reads this first. A declared FPS that does not match actual data cadence corrupts every temporal operation — observation history, delta timestamps, action chunking windows.
- stats.json stores mean, std, min, max for every feature. This is consumed automatically as dataset.meta.stats to normalize every action and observation before they enter the model. If these statistics were computed on a subset, before clipping, or carried over from a previous dataset version, the model normalizes against a distribution that does not reflect reality. Training proceeds. Loss converges. The model has learned in the wrong scaled space.
- episodes/ (v3) or episodes.jsonl (v2.1) maps episode indices to frame offsets. Wrong offsets mean the model silently learns transitions that never happened — the last frame of episode N paired with the first action of episode N+1.

Two distinctions worth internalizing that the official docs underemphasize: timestamp and frame_index are not interchangeable — frame_index is a counter, timestamp is actual wall-clock seconds, and naive converters that increment one without computing the other introduce desynchronization that compounds across long episodes. And stats.json is not cosmetic metadata — it is the normalization contract, consumed automatically, trusted implicitly.

For the full format specification, the LeRobot documentation covers the structural details. What follows is what the docs do not cover: what breaks silently when the semantic layer is wrong.

The Silent Killers

These are the failure modes that survive structural validation. For each one: what it is and why the training loop does not surface it. The diagnostic checks for all of them are in the next section.

1. Action Representation Mismatch

The action column can hold absolute joint positions, joint velocities, Cartesian end-effector deltas, absolute end-effector poses, or IK-relative commands. These are not interchangeable. The format accepts all of them without complaint.

Sim pipelines compound this. IsaacLab distinguishes between raw actions (what the controller received) and processed_actions (what was actually executed after IK solving). Copying the wrong field into your LeRobot action column is a one-line mistake with no error signal.

Why it is silent. The training loop optimizes numbers, not their physical meaning. The model learns a perfectly consistent — but semantically wrong — policy.

2. Coordinate Frame Confusion

Every action vector exists in a reference frame — world frame, robot base frame, end-effector frame. Sim environments often default to world frame because that is natural for physics simulation. Most VLA models expect base-frame or end-effector-frame actions. LeRobot has no field that declares which frame was used.

This was our earliest failure. Data from the sim was in world frame. The VLA expected base-frame actions. The dataset was structurally perfect. The robot moved confidently in the wrong direction — which is where this post started.

Why it is silent. Frame errors are numerically invisible. Values are valid floats in a valid range. Nothing in the training pipeline checks geometric consistency between action vectors and robot kinematics.

3. Model-Data Contract Mismatch

This failure is not about what is in your data but about what your target model assumes about how the data will be used — assumptions that live in the model’s architecture, not in any dataset schema.

In our case: we were fine-tuning Pi0, which implements action chunking internally. We were also chunking during conversion — pre-computing multi-step sequences and writing them into the dataset. The result was double-chunked actions. Training ran without error. The policy had learned to execute action sequences at the wrong temporal granularity.

This extends beyond chunking. Different VLA architectures — Pi0, GR00T N1.5, SmolVLA — make different assumptions about action dimensionality, observation history length, and gripper state encoding. None of these are enforced at the dataset level.

Why it is silent. The model does not validate its architectural assumptions against the dataset schema. The mismatch manifests as degraded policy quality, not a training error.

4. Normalization Stat Corruption

stats.json stores mean, std, min, and max for every feature. The training loop consumes these automatically to normalize actions and observations. If the statistics are wrong — computed on a subset, before clipping, or carried over from a previous version — every action the model sees during training is scaled incorrectly.

Why it is silent. The model learns to predict normalized actions in the wrong scaled space. It is internally consistent — loss converges — just wrong relative to the real action distribution. The error is completely invisible in the loss curve.

5. Timestamp and Frequency Desynchronization

Sim pipelines run at a control frequency often different from rendering frequency. Naive converters copy frame indices rather than computing actual timestamps. Declared FPS implies one temporal relationship; actual data cadence implies another. Temporal operations — observation history, delta timestamps, action chunking windows — all compute against declared FPS.

Why it is silent. For short episodes and small datasets the effect is imperceptible. It compounds with scale, degrading full training runs in ways that look like a policy capacity problem, not a data problem.

6. Episode Boundary Corruption

Episode metadata maps each episode index to its frame offsets. Wrong offsets — off by one, misaligned at chunk boundaries, corrupted during version conversion — mean the model silently learns transitions that never occurred. The effect scales with episode count and contaminates every boundary in the dataset.

A metadata boundary bug in the v2.1-to-v3 conversion path has been reported and reproduced in the community — frames load without error, episode counts look correct, but boundaries are silently shifted.

Why it is silent. The data is intact. Files are readable. Episode counts match. The corruption is entirely in the index layer, which the training loop trusts implicitly.

7. Visual Observation Errors

LeRobot expects images in RGB. Sim pipelines and some camera drivers output BGR. The array shape is identical — (H, W, 3) — and nothing enforces channel ordering. We caught this on our wrist camera during a visual inspection pass. For color-relevant tasks this is a meaningful degradation. For color-agnostic tasks it may be tolerable — but you should know, not guess.

Why it is silent. The array is valid, shape matches, values are in [0, 255]. Nothing in the training pipeline inspects what the image actually shows.

Severity vs. Detectability

Not all seven failures are equally dangerous. The ones that matter most are not the hardest to fix — they are the hardest to notice. A useful way to think about prioritization:

High severity, low detectability — coordinate frame confusion, normalization stat corruption, model-data contract mismatch. These produce internally consistent training runs that fail at deployment. By the time you notice, you have already spent the GPU budget.

High severity, higher detectability — episode boundary corruption, action representation mismatch. Severe if missed, but detectable with the right checks before training starts.

Lower severity, high detectability — visual observation errors, timestamp desynchronization in small datasets. Catchable quickly once you know to look.

The pre-flight ritual below is ordered accordingly.

The Pre-Flight Validation Ritual

Run this every time you produce or modify a converted dataset — not just on the first conversion. The checks are ordered by the severity-detectability logic above: the failures that are hardest to notice and most expensive to miss come first.

Check 1 — Episode Boundary Integrity

Catches: Episode Boundary Corruption

Sum all episode lengths from your episode metadata and compare to the total row count across all Parquet data files. They must match exactly — no tolerance, no rounding. A discrepancy of even one row means at least one boundary is wrong.

import pandas as pd
import glob

# Load episode metadata (v2.1 example)
episodes = pd.read_json("meta/episodes.jsonl", lines=True)
declared_total = episodes["length"].sum()

# Count actual rows across all data Parquet files
parquet_files = glob.glob("data/**/*.parquet", recursive=True)
actual_total = sum(len(pd.read_parquet(f)) for f in parquet_files)

assert declared_total == actual_total, (
    f"Boundary mismatch: declared {declared_total} frames, "
    f"found {actual_total} in Parquet files"
)

In v2.1, episode lengths are in episodes.jsonl. In v3, they are in chunked Parquet files under meta/episodes/. The logic is the same either way.

Check 2 — Timestamp Monotonicity and FPS Consistency

Catches: Timestamp and Frequency Desynchronization

Within each episode, verify timestamps are strictly monotonically increasing and that the inter-frame delta is consistent with the FPS declared in info.json. Check episodes near the end of your dataset — floating-point accumulation errors appear late, which is why early dry runs miss them.

import json
import numpy as np

with open("meta/info.json") as f:
    info = json.load(f)
declared_fps = info["fps"]
expected_delta = 1.0 / declared_fps
tolerance = 1e-3  # adjust based on your control frequency

df = pd.read_parquet("data/chunk-000/episode_000099.parquet")  # check a late episode
timestamps = df["timestamp"].values

assert np.all(np.diff(timestamps) > 0), "Timestamps not monotonically increasing"
deltas = np.diff(timestamps)
assert np.allclose(deltas, expected_delta, atol=tolerance), (
    f"FPS mismatch: expected delta {expected_delta:.4f}s, "
    f"got mean {deltas.mean():.4f}s, max drift {np.abs(deltas - expected_delta).max():.6f}s"
)

Check 3 — Action Distribution Sanity

Catches: Action Representation Mismatch

Load the action column and compute per-dimension statistics: mean, std, min, max, and frame-to-frame delta. Compare against what your simulator actually produced — not what you assumed it produced.

Positions are bounded by joint limits with small frame-to-frame deltas. Velocities are centered near zero with higher variance. If the distribution does not match your intended representation, the conversion likely copied the wrong field. In IsaacLab specifically: actions and processed_actions are not the same — verify which one your converter captured.

Check 4 — Normalization Contract Verification

Catches: Normalization Stat Corruption

Load stats.json and compute the same statistics directly from your full Parquet dataset. For a correctly computed dataset they match to floating-point precision. Any meaningful discrepancy means recompute stats.json from scratch against the complete dataset.

import json
import numpy as np
import pandas as pd
import glob

with open("meta/stats.json") as f:
    declared_stats = json.load(f)

all_actions = pd.concat([
    pd.read_parquet(f)["action"].apply(pd.Series)
    for f in glob.glob("data/**/*.parquet", recursive=True)
])

computed_mean = all_actions.mean().values
declared_mean = np.array(declared_stats["action"]["mean"])

np.testing.assert_allclose(
    computed_mean, declared_mean, rtol=1e-4,
    err_msg="stats.json mean does not match actual data — recompute from full dataset"
)

Check 5 — Video Integrity and Codec Verification

Catches: Video Decode Failures, Timestamp Desynchronization

Verify every MP4 decodes without error. Confirm the codec is H.264 (avc1) as declared in info.json. Check that frame count in each video matches the row count in the corresponding Parquet episode data. AV1 and other codecs common in sim pipelines will either fail outright or decode incorrectly.

Check 6 — Visual Inspection

Catches: Channel Ordering, Scene Coherence

Load raw frames from several episodes across each camera and look at them. Verify color rendering — an object you know is red should render red. Step through a few episodes and confirm the visual sequence matches the demonstrated behavior.

This is the one check no automated tool fully replaces. Statistical checks cannot tell you that your wrist camera has inverted channels or that the visual sequence does not match the motion. Two minutes per camera is enough.

Check 7 — Model-Data Contract Verification

Catches: Model-Data Contract Mismatch

Read the target model’s data loading code — not just the documentation, the actual code. Verify expected action dimensionality, whether the model handles action chunking internally or expects pre-chunked sequences, how gripper state should be encoded, and what observation keys the model expects.

This is the only model-specific check in the ritual. Architectural assumptions not enforced at the dataset level — Pi0’s internal action chunking being the one we learned the hard way — will not surface in any of the checks above.

At a Glance

#	Check	Silent Killer Caught	Severity	Detectability	Effort
1	Episode boundary integrity	Boundary corruption	High	Medium	< 1 min
2	Timestamp monotonicity	Temporal desync	Medium	Low	< 5 min
3	Action distribution sanity	Representation mismatch	High	Medium	< 5 min
4	Normalization contract	Stat corruption	High	Low	< 5 min
5	Video integrity and codec	Decode failures	Medium	High	< 10 min
6	Visual inspection	Channel ordering	Medium	High	10–15 min
7	Model-data contract	Architecture mismatch	High	Low	15–30 min

Total: under an hour. The cost of skipping it is at minimum one wasted training run — at worst a deployment failure that is very hard to trace back to a data problem.

What We Are Still Investigating

The seven failure modes above are the ones we have hit or can reproduce. They are not the ceiling of what can go wrong. Three open problems we are actively working through at Fireloop:

Gripper state encoding across architectures. Pi0, GR00T, and SmolVLA encode gripper open/close differently — binary, continuous, normalized. The mismatch between what your sim recorded and what the model expects is a silent killer in the same class as action representation mismatch. We have not hit a clean failure here yet but we expect to, and we are building the check for it now.

Multi-camera temporal alignment. Verifying that frames across different camera views are actually synchronized — not just that each stream is internally consistent — is harder than it looks. A wrist camera and a workspace camera running at nominally the same FPS can accumulate meaningful drift over long episodes. We do not yet have a reliable check for this.

Action smoothness as a data quality signal. Abrupt discontinuities within an episode often signal a conversion artifact rather than real demonstrated behavior. We are developing intuition for what the right smoothness thresholds are across different robot embodiments and task types — it is not a single number.

These are the problems we are pushing on next. If you are further along on any of them, we want to know.

Your Turn

If you have hit a failure mode in sim-to-LeRobot conversion that is not covered here — a silent killer we missed, a check that saved a training run, or a model-specific contract issue — share it.

Specifically: has anyone hit gripper encoding issues when converting sim data for Pi0 or GR00T fine-tuning? We are actively investigating this and a concrete reproduction case would accelerate our work significantly.

The seven checks in this ritual are not the final answer. They are what we have validated so far. Every practitioner who documents a failure mode clearly makes the next person’s iteration faster. Share your experience in the comments, open a reproducible issue against the relevant repository, or reach out directly.

The ecosystem gets better when the people debugging it write down what they find.