RC · Recording Data Integrity
LuckyEngine's product value depends on the integrity of the data it records. Datasets that look fine but have missing frames, off-by-one timestamps, silently-dropped writes, or non-deterministic noise are worse than no data — they corrupt downstream training without being noticed. Every change to a recording-critical path must be evaluated against this doc.
Findings in recording-critical files are auto-promoted to must-fix by /cr. Treat every patch that touches a recorder, writer, observer, or the Acquisition / Export phase as a higher-rigour review. Silent data corruption is the worst-case outcome — you'd rather fail loudly than emit a quietly broken dataset.
What is "recording-critical"
A code path is recording-critical if it directly produces, schedules, validates, or finalises observation / dataset output. Concretely:
Hazel/src/Hazel/Data/Recorders/*—RobotRecorder,MujocoFullRecorder,DroneRecorder,CameraRecorder,DataRecorder.Hazel/src/Hazel/Data/Writers/*—ParquetDataWriter,FFmpegWriter,MaskImageWriter,WriterInterfaces.Hazel/src/Hazel/Data/Observer.{h,cpp}— the central observation layer.Hazel/src/Hazel/Data/EpisodeReportStreamer.{h,cpp}— episode report → LuckyHub streaming.- TimeManager Acquisition and Export phase callbacks (where data is captured and emitted).
GrpcStepSystem/GrpcCapturePool— when they snapshot state for over-the-wire delivery.- The Headless / training entry path (
HeadlessApplication,EpisodeRunner-style code) when used in a recording configuration.
Rule: changes touching any of the above are reviewed at higher rigour than ordinary engine code. /cr auto-promotes findings in these paths to must-fix.
Recording lifecycle
The episode state machine governs when on-disk state may be trusted. Crash-safe layout means an interrupted episode is observable as "in progress" or as "not present", never as "complete" but with a truncated payload.
The four integrity guarantees
Every recording must satisfy all four.
1. Completeness
Every step that ran produced a row. Every camera frame captured got written. No silent skips.
- No
try { ... } catch (...) { /* warn and continue */ }in a recording path. If a write fails, the episode is corrupt — fail the episode, don't cover it up. - No "best-effort" file ops where the failure mode is data loss.
WARN-on-failure is acceptable for diagnostics;WARN-and-keep-recordingis not. - If a frame is dropped (e.g., camera capture overrun), it must be detected and either backfilled with a sentinel or recorded as a gap with explicit metadata. Never silently emit row N+1 after row N-1.
2. Determinism
Re-running the same simulation with the same seed produces byte-identical recorded data (within the engine's defined determinism contract — see Mode::HighPerformanceDeterministic in TimeManager).
- No unordered containers (
std::unordered_map,std::unordered_set) for iteration-order-sensitive data in recording paths. Usestd::mapor insertion-ordered containers. - No platform-dependent math results in fields that go to disk. Watch for
sinf/cosfdiscrepancies between MSVC and Clang on Linux / macOS — pin to a deterministic library or fixed-precision representation. - No timing-dependent fields (
steady_clock::now()deltas) recorded as values that downstream consumers compare across runs. Record absolute times (or simulation tick counts) only. - RNG state is seeded explicitly and lives in known places. Never call thread-local RNGs from a recording path.
3. Atomicity
If the editor crashes mid-episode, the on-disk state is one of:
- "Episode N was complete and finalised" — fully written, parquet / video closed, manifest updated.
- "Episode N was in progress, never finalised" — partial files exist but the manifest does not claim the episode is complete; downstream tooling skips it.
There's no third state where the manifest says "complete" but the parquet truncates mid-row, or the video has a half-written frame.
- The manifest update is the last write of an episode. Everything else (parquet rows, video frames, JSON sidecars) must be flushed and synced before the manifest is touched.
- Crash-safe layout: write to
episode-N.tmp/, atomically rename toepisode-N/only on success. Or use a marker file (.complete) that's the last thing written. - Don't update a "current state" pointer or registry until the new state is fully on disk.
4. Schema fidelity
Every recorded field has a documented type, shape, and unit. Schema changes are versioned.
ObservationSchemamismatches (declared vs. actually-written) must fail loudly. Don't silently coerce.- Adding a new column requires a schema version bump and either backfill or explicit handling for old datasets.
- Don't change the meaning of an existing column (e.g., switch units from radians to degrees) without a new column name. Old datasets become invalid otherwise.
Performance constraints in recording paths
Even though most recording happens at 30 Hz (well under per-frame budgets), the constraints are tighter than they look because episodes can run for hours and accumulate millions of writes.
- No per-row allocations in tight write loops. Pre-size buffers, reuse them. A
std::stringconstructed per row across a 10M-row episode is 10M allocations. - No file-handle churn. Open writer once per episode, write through it, close at episode end. Don't open / close per row.
- No format-string overhead per write. If you're calling
std::formatper row, batch into a column writer instead. - Camera / video paths: see
FFmpegWriterfor the existing pattern. Don't decode / encode synchronously on the simulation thread; the writer should buffer and let an encoder thread drain (this is already in place — extend it, don't bypass it).
Concurrency rules in recording paths
Most recording happens on TimeManager runners (Acquisition for capture, Export for write-out). The rules from Threading apply, plus:
- No race between writer and reader of the recording buffer. If the simulation thread fills a buffer that an encoder thread drains, the handoff goes through a synchronised queue or a double-buffer with explicit barriers.
- No mid-step finalisation. An episode's final write must happen at a tick boundary, not mid-tick. Otherwise some streams may have row N and others row N-1 for the "last" step.
- gRPC capture is a snapshot, not a live read. Don't stream out of buffers that are still being mutated by the simulation thread.
What /cr flags as must-fix
Findings auto-promoted to must-fix when the file is recording-critical:
| Pattern | Why it's must-fix |
|---|---|
catch (...) { warn; continue; } in a writer | Violates Completeness — silently drops data |
std::unordered_map iterated to disk | Violates Determinism — order varies by platform / build |
std::format / std::string allocation in a per-row loop | Performance constraint — accumulates over long episodes |
| Manifest update before stream flush | Violates Atomicity — crash mid-flush leaves "complete" lying about partial data |
| New column added without schema version bump | Violates Schema fidelity — old readers misinterpret data |
| Mutex held across file write | Stalls the simulation thread; in RealtimeNonDeterministic mode this drops ticks |
steady_clock::now() recorded as cross-run comparison field | Violates Determinism |
| File-handle open / close inside a hot write loop | Performance constraint + can leave dangling temp files on crash |
| Reading from a buffer concurrently being written without a lock / atomic | Race — produces corrupted rows |
For consider-tier findings in recording paths, document the suspicion explicitly even if you don't fix — the user reviews these manually.
When you're modifying a recording path
Before opening the file, ask:
- Which guarantee am I touching? Completeness, Determinism, Atomicity, or Schema fidelity? (Often more than one.)
- What's the failure mode if I'm wrong? Silent corruption is the worst case — flag any change whose worst case is "downstream training silently degrades."
- Is there an existing pattern? The recorder / writer family already solves the common cases. Extend rather than re-implementing.
- What's the test plan? Can I run an episode and diff its output against a known-good baseline? If not, a code-only review is insufficient — flag this in the PR description.
If a fix here requires a schema change, version bump, or migration path, that's a separate task. Don't bundle it into an unrelated change.
Related
- Data / Observation — the engine-side schema, observer, and recorder family.
- Threading — TimeManager phases, worker rules, gRPC marshalling.
.claude/docs/RecordingIntegrity.md— canonical source.