Lucky Robots Blog Open Roles

RC · Recording Data Integrity

LuckyEngine's product value depends on the integrity of the data it records. Datasets that look fine but have missing frames, off-by-one timestamps, silently-dropped writes, or non-deterministic noise are worse than no data — they corrupt downstream training without being noticed. Every change to a recording-critical path must be evaluated against this doc.

High-rigour review Doc source: .claude/docs/RecordingIntegrity.md Recorders: Hazel/src/Hazel/Data/Recorders/ Writers: Hazel/src/Hazel/Data/Writers/
Read this first

Findings in recording-critical files are auto-promoted to must-fix by /cr. Treat every patch that touches a recorder, writer, observer, or the Acquisition / Export phase as a higher-rigour review. Silent data corruption is the worst-case outcome — you'd rather fail loudly than emit a quietly broken dataset.

What is "recording-critical"

A code path is recording-critical if it directly produces, schedules, validates, or finalises observation / dataset output. Concretely:

  • Hazel/src/Hazel/Data/Recorders/*RobotRecorder, MujocoFullRecorder, DroneRecorder, CameraRecorder, DataRecorder.
  • Hazel/src/Hazel/Data/Writers/*ParquetDataWriter, FFmpegWriter, MaskImageWriter, WriterInterfaces.
  • Hazel/src/Hazel/Data/Observer.{h,cpp} — the central observation layer.
  • Hazel/src/Hazel/Data/EpisodeReportStreamer.{h,cpp} — episode report → LuckyHub streaming.
  • TimeManager Acquisition and Export phase callbacks (where data is captured and emitted).
  • GrpcStepSystem / GrpcCapturePool — when they snapshot state for over-the-wire delivery.
  • The Headless / training entry path (HeadlessApplication, EpisodeRunner-style code) when used in a recording configuration.

Rule: changes touching any of the above are reviewed at higher rigour than ordinary engine code. /cr auto-promotes findings in these paths to must-fix.

Recording lifecycle

The episode state machine governs when on-disk state may be trusted. Crash-safe layout means an interrupted episode is observable as "in progress" or as "not present", never as "complete" but with a truncated payload.

Idle no episode PendingApproval awaiting operator OK Recording writing tmp dir Finalising flush, rename, manifest Complete .complete marker Rejected Aborted request approve finish episode manifest written deny crash / abort
Recording state machine. Blue = forward path, green = finalisation, red = failure / rejection. Crash from Recording leaves partial tmp output but no manifest claim — downstream tooling skips it.

The four integrity guarantees

Every recording must satisfy all four.

1. Completeness

Every step that ran produced a row. Every camera frame captured got written. No silent skips.

  • No try { ... } catch (...) { /* warn and continue */ } in a recording path. If a write fails, the episode is corrupt — fail the episode, don't cover it up.
  • No "best-effort" file ops where the failure mode is data loss. WARN-on-failure is acceptable for diagnostics; WARN-and-keep-recording is not.
  • If a frame is dropped (e.g., camera capture overrun), it must be detected and either backfilled with a sentinel or recorded as a gap with explicit metadata. Never silently emit row N+1 after row N-1.

2. Determinism

Re-running the same simulation with the same seed produces byte-identical recorded data (within the engine's defined determinism contract — see Mode::HighPerformanceDeterministic in TimeManager).

  • No unordered containers (std::unordered_map, std::unordered_set) for iteration-order-sensitive data in recording paths. Use std::map or insertion-ordered containers.
  • No platform-dependent math results in fields that go to disk. Watch for sinf/cosf discrepancies between MSVC and Clang on Linux / macOS — pin to a deterministic library or fixed-precision representation.
  • No timing-dependent fields (steady_clock::now() deltas) recorded as values that downstream consumers compare across runs. Record absolute times (or simulation tick counts) only.
  • RNG state is seeded explicitly and lives in known places. Never call thread-local RNGs from a recording path.

3. Atomicity

If the editor crashes mid-episode, the on-disk state is one of:

  • "Episode N was complete and finalised" — fully written, parquet / video closed, manifest updated.
  • "Episode N was in progress, never finalised" — partial files exist but the manifest does not claim the episode is complete; downstream tooling skips it.

There's no third state where the manifest says "complete" but the parquet truncates mid-row, or the video has a half-written frame.

  • The manifest update is the last write of an episode. Everything else (parquet rows, video frames, JSON sidecars) must be flushed and synced before the manifest is touched.
  • Crash-safe layout: write to episode-N.tmp/, atomically rename to episode-N/ only on success. Or use a marker file (.complete) that's the last thing written.
  • Don't update a "current state" pointer or registry until the new state is fully on disk.

4. Schema fidelity

Every recorded field has a documented type, shape, and unit. Schema changes are versioned.

  • ObservationSchema mismatches (declared vs. actually-written) must fail loudly. Don't silently coerce.
  • Adding a new column requires a schema version bump and either backfill or explicit handling for old datasets.
  • Don't change the meaning of an existing column (e.g., switch units from radians to degrees) without a new column name. Old datasets become invalid otherwise.

Performance constraints in recording paths

Even though most recording happens at 30 Hz (well under per-frame budgets), the constraints are tighter than they look because episodes can run for hours and accumulate millions of writes.

  • No per-row allocations in tight write loops. Pre-size buffers, reuse them. A std::string constructed per row across a 10M-row episode is 10M allocations.
  • No file-handle churn. Open writer once per episode, write through it, close at episode end. Don't open / close per row.
  • No format-string overhead per write. If you're calling std::format per row, batch into a column writer instead.
  • Camera / video paths: see FFmpegWriter for the existing pattern. Don't decode / encode synchronously on the simulation thread; the writer should buffer and let an encoder thread drain (this is already in place — extend it, don't bypass it).

Concurrency rules in recording paths

Most recording happens on TimeManager runners (Acquisition for capture, Export for write-out). The rules from Threading apply, plus:

  • No race between writer and reader of the recording buffer. If the simulation thread fills a buffer that an encoder thread drains, the handoff goes through a synchronised queue or a double-buffer with explicit barriers.
  • No mid-step finalisation. An episode's final write must happen at a tick boundary, not mid-tick. Otherwise some streams may have row N and others row N-1 for the "last" step.
  • gRPC capture is a snapshot, not a live read. Don't stream out of buffers that are still being mutated by the simulation thread.

What /cr flags as must-fix

Findings auto-promoted to must-fix when the file is recording-critical:

PatternWhy it's must-fix
catch (...) { warn; continue; } in a writerViolates Completeness — silently drops data
std::unordered_map iterated to diskViolates Determinism — order varies by platform / build
std::format / std::string allocation in a per-row loopPerformance constraint — accumulates over long episodes
Manifest update before stream flushViolates Atomicity — crash mid-flush leaves "complete" lying about partial data
New column added without schema version bumpViolates Schema fidelity — old readers misinterpret data
Mutex held across file writeStalls the simulation thread; in RealtimeNonDeterministic mode this drops ticks
steady_clock::now() recorded as cross-run comparison fieldViolates Determinism
File-handle open / close inside a hot write loopPerformance constraint + can leave dangling temp files on crash
Reading from a buffer concurrently being written without a lock / atomicRace — produces corrupted rows

For consider-tier findings in recording paths, document the suspicion explicitly even if you don't fix — the user reviews these manually.

When you're modifying a recording path

Before opening the file, ask:

  1. Which guarantee am I touching? Completeness, Determinism, Atomicity, or Schema fidelity? (Often more than one.)
  2. What's the failure mode if I'm wrong? Silent corruption is the worst case — flag any change whose worst case is "downstream training silently degrades."
  3. Is there an existing pattern? The recorder / writer family already solves the common cases. Extend rather than re-implementing.
  4. What's the test plan? Can I run an episode and diff its output against a known-good baseline? If not, a code-only review is insufficient — flag this in the PR description.
Don't bundle

If a fix here requires a schema change, version bump, or migration path, that's a separate task. Don't bundle it into an unrelated change.

  • Data / Observation — the engine-side schema, observer, and recorder family.
  • Threading — TimeManager phases, worker rules, gRPC marshalling.
  • .claude/docs/RecordingIntegrity.md — canonical source.