The Annotation Bottleneck — Failure Modes, Diagnostic Signals, and a Validation Framework

The Problem Nobody Talks About

Synthetic data generation pipelines for robot imitation learning have matured fast. Tools like IsaacMimic and SkillGen can take a handful of human demonstrations and generate thousands of training episodes at scale. When the pipeline works, the results are compelling. When it doesn’t, the failure is rarely where people expect.

The physics are mature. The policy architectures are well-validated. The demonstration collection, assuming reasonable teleoperation quality, is not usually the problem. The problem is frequently the step sitting between demonstration collection and data generation — subtask annotation — and the field has not treated it with the seriousness it deserves.

Subtask annotation is the process of dividing teleoperated demonstrations into labelled segments that the data generator uses to transform and recombine trajectories across new scene configurations. It is manual, judgment-intensive, has no standardised methodology, limited tooling, and almost no community guidance beyond brief documentation notes. It is also the step whose quality most directly determines whether your generated dataset is genuinely useful for policy training — or subtly, silently broken.

This blog is about that step: how it goes wrong, and how to validate it systematically before committing to expensive generation runs.

What You Are Actually Deciding

In IsaacMimic and SkillGen, a subtask is defined by three things: a start boundary, an end boundary, and a reference object — the object whose pose the generator uses to transform that segment into a new scene configuration.

The generator does not replay original trajectories. It takes each subtask segment, computes the transformation between the reference object pose in the source demonstration and its pose in the new scene, applies that transformation to the end-effector trajectory, and interpolates between transformed segments. Annotation decisions directly control four things:

What gets transformed — segment boundaries determine which portion of the trajectory is handled per subtask

The spatial reference — reference object assignment defines the coordinate frame for transformation

How segments stitch together — boundary states determine interpolation quality; unstable boundary states produce unstable interpolations

How arms synchronise — in dual-arm tasks, coordination constraints define temporal relationships between arm subtasks

Each is a direct consequence of annotation. Understanding this makes it easier to reason about what to look for when things go wrong.

The Four Ways Annotation Goes Wrong

Annotation failures are not random. They fall into four distinct failure modes, each with a different root cause and a different diagnostic signature.

Failure Mode 1 — Wrong Granularity

Too few subtasks limits augmentation diversity — the generator has less to work with and the dataset clusters around source demonstrations rather than genuinely exploring the configuration space. Too many subtasks increases stitching operations, each of which is an interpolation opportunity for collision, kinematic infeasibility, or unnatural motion.

The instinct to over-segment — annotating every perceptible phase change — is worth resisting. Use as few subtasks as the task genuinely requires. A task with two clearly distinct contact phases needs two subtasks. The simpler segmentation should be the starting point, not the fallback.

Failure Mode 2 — Wrong Placement

Granularity determines how many boundaries you draw. Placement determines where each sits. Placement errors are more subtle and more damaging.

The principle is straightforward: boundaries should be placed where the robot is in free space, in a stable configuration, with the reference object in a well-defined pose. The most common error is annotating too early — placing a boundary immediately after a contact event before the robot has stabilised from it.

Consider a grasp: the natural instinct is to mark the boundary when the gripper closes. But at that moment the object pose may not be stable and the arm is likely still close to the table surface. The generator attempts to reproduce this from a new configuration and interpolate into the next subtask from that unstable state. Collisions and failed attempts follow predictably. The better placement is after grasp and lift — object clear of the surface, grasp stable, arm with room to interpolate freely. The difference in boundary position may be a few timesteps. The difference in generation quality is not small.

A secondary error is annotating during dynamic transitions — mid-pour, mid-insertion, mid-rotation — where object pose is changing rapidly and reference frame assumptions are least reliable. Place boundaries just before or just after dynamic phases, not within them.

Failure Mode 3 — Wrong Reference Object Assignment

The reference object for a subtask should be the object the robot is primarily interacting with during that phase. For a pick subtask, it is the object being picked. For a place subtask, it is the target location or container.

The diagnostic signature here is distinct: generation success rates may not drop dramatically, but trajectories show systematic positional bias. The robot approaches from the right direction relative to the wrong object. In visual inspection this looks like geometrically coherent but contextually wrong motion — correct relative to something, just not the right thing.

In multi-object tasks, verify reference object assignments for each subtask as a separate review step, independent of boundary placement.

Failure Mode 4 — Wrong Coordination Assumptions (Dual-Arm)

Single-arm annotation is a linear problem. Dual-arm annotation is a graph problem — each arm has its own subtask sequence and those sequences interact. The most common error is under-specifying temporal relationships between arm subtasks.

IsaacLab Mimic’s SubTaskConstraintConfig allows defining whether subtasks must end simultaneously, one must precede the other, or they are independent. Leaving these unspecified produces trajectories where arms execute correct individual subtasks out of sync — a bimanual handoff where one arm arrives before the other is ready, a coordinated lift where one arm begins before the other has completed its grasp. These failures are difficult to catch from success rate alone because individual subtask execution may be fine. The failure is in coordination, which only manifests when both arms run together.

We have not extensively validated dual-arm annotation in our own work and want to be direct about that scope. The failure modes compound significantly at this level of complexity and the community has even less guidance here than for single-arm tasks.

Compound Failures

These four modes are analytically distinct but practically they compound. Working through them systematically — granularity first, then placement, then reference object, then coordination — provides a more reliable diagnostic path than intuition alone. This is also the logic behind the validation framework in the next section: each layer is designed to surface a different subset of these failure modes.

Validating Annotations Before You Commit

One important assumption underlies this section: the source demonstrations are good. Annotation quality cannot rescue a poorly executed teleoperation dataset. How to collect quality demonstrations is its own topic. Here we assume that problem is already solved.

Most practitioners treat annotation as a one-shot step — mark the boundaries, move to generation. In our experience this is where quietly broken pipelines are born. The approach we have found useful is to treat annotation validation as its own explicit pipeline stage with three layers of signal, applied before committing to full-scale generation.

Layer 1 — Generation Success Rate as a Diagnostic Signal

IsaacMimic and SkillGen report how many generation attempts were required to produce a given number of successful demonstrations. Most practitioners treat this as a compute cost. We treat it as a quality signal.

Before full generation, run a pilot targeting 10-20% of your intended dataset size and observe the success rate. If it is significantly lower than expected, the annotation is the first place to look — not the environment configuration or the robot controller.

More usefully: compare success rates across two or three boundary placement variants for the same task. The variant producing the highest success rate with the fewest attempts is almost always the better annotation — a direct reflection of how reliably the generator can interpolate between your subtask segments.

Layer 2 — Visual Trajectory Inspection

Generation success rate tells you something is wrong. Visual inspection tells you what.

Inspect a sample of both successful and failed trajectories from your pilot run. Look specifically for:

At subtask transitions: Smoothness is the primary signal. A well-placed boundary produces natural motion continuation. A poorly placed one produces visible discontinuities — direction changes, speed spikes, or unnatural pauses at the interpolation seam.

Gripper behaviour: Unnecessary open/close cycles near transition points reliably indicate a boundary placed during or too close to a contact phase. The gripper should be in a stable state at every boundary crossing.

Lift and drop heights: Inconsistency across generated trajectories suggests the reference object frame is not being correctly captured at the boundary — typically pointing to placement before the object has reached a stable pose.

Joint jerk: Abrupt velocity changes at transitions indicate the interpolation is bridging an unreasonably large gap. Investigate the boundary before adjusting interpolation step counts.

The goal is not perfect trajectories — some variance is expected and desirable. The goal is identifying systematic failure patterns that trace back to a specific boundary decision.

Layer 3 — Lightweight Policy Smoke Test

This is the most underused validation step and in our experience the most revealing one.

Train a basic behavioural cloning policy on your small pilot dataset. Keep the architecture simple, the training short. You are not producing a deployable policy — you are answering one question: does this data have learnable structure?

A dataset from good annotations will show coherent transitional behaviour even in a lightly trained policy. It may not be accurate, but it will demonstrate learned task phase transitions — attempting the right action at roughly the right time.

A dataset from poor annotations produces a policy confused at transition points — hesitating, exhibiting mode-switching behaviour, failing to chain subtasks together even when individual subtask execution looks reasonable. This is the silent failure mode made visible. The generated data looked fine, the success rate was acceptable, but the underlying structure was subtly wrong.

A small dataset with a lightweight BC model trains in minutes to an hour. The signal it gives you is worth the time before committing to full-scale generation and a complete training run.

Putting It Together

These three layers work as a sequence. Generation rate flags a problem. Visual inspection localises it. The policy smoke test confirms whether the fix was sufficient.

No single layer is conclusive alone. A high generation success rate with poor visual smoothness is still a bad annotation. A visually smooth dataset that produces a confused policy at transitions is still a data quality problem. Used together they give reasonable confidence before the expensive steps.

These three layers reflect what we have found useful for the tasks and complexities we have worked with — they are a starting point, not a complete prescription. Task complexity, robot morphology, and contact richness will surface additional signals. If you are working on more complex setups and have found signals that belong in this list, we would genuinely like to hear from you.

The following is a short clip from our own work — a trained policy executing the task after going through the annotation and validation process described above. The smoothness of execution and clean task phase transitions are a direct reflection of annotation quality. A poorly annotated pipeline producing the same volume of data would not yield this result.

The Automation Landscape — Honest Assessment

The natural question is whether this validation burden can be automated. The honest answer: partially, and not yet in production-ready form for IsaacLab pipelines.

Heuristic-based annotation is the most mature option currently available. For simple structured tasks with reliable state signals — gripper contact, object height thresholds, joint position targets — algorithmic boundary detection works reasonably well. It becomes brittle as task complexity increases. For anything beyond basic pick-and-place it requires significant task-specific engineering and still needs manual validation.

VLM and LLM-assisted segmentation is an active and promising research direction. Approaches like SeeDo and ECoT demonstrate that large vision-language models can interpret demonstration videos and annotate trajectories with subtask labels at scale. These are not currently integrated into IsaacLab workflows in production-ready form — bridging their outputs to the SubTaskConfig format IsaacMimic and SkillGen expect remains a non-trivial engineering gap the community has not yet standardised.

Annotation-free imitation learning sidesteps the problem entirely, using keyframe selection and visual alignment rather than explicit subtask boundaries. Recent ICRA 2025 work demonstrates one-shot imitation for multi-step tasks without manual annotation. Intellectually promising — not yet a practical alternative for teams working within current IsaacMimic and SkillGen pipelines.

The practical reality: subtask annotation remains manual and judgment-intensive for most teams today. The automation trajectory is clear. The gap between research demonstration and production-ready tooling is still significant.

Annotation Is a First-Class Engineering Decision

Subtask annotation is not a preprocessing detail. It is a design decision with direct consequences for data quality, generation efficiency, and policy performance. The field’s tendency to underinvest in it is understandable — papers focus on architectures, documentation focuses on tool usage, and annotation sits between them, unglamorous and under-specified.

The cost is paid silently. Failed generation runs. Policies that don’t converge cleanly. Debugging sessions that trace back to a boundary placed three timesteps too early.

What we have described here — four failure modes, three validation layers — is a structured starting point, not a universal prescription. The practical takeaway is simple: before full-scale generation, validate your annotations. Run the pilot. Inspect the trajectories. Train the smoke test policy. The iteration cost at annotation time is a fraction of the cost of discovering a data quality problem after a full training run.

The methodology improves when more people contribute to it. If you are building IL or VLA pipelines and have found signals or approaches that belong in this framework — we would genuinely like to hear from you.