RC · Recording Data Integrity

LuckyEngine's product value depends on the integrity of the data it records. Datasets that look fine but have missing frames, off-by-one timestamps, silently-dropped writes, or non-deterministic noise are worse than no data — they corrupt downstream training without being noticed. Every change to a recording-critical path must be evaluated against this doc.

High-rigour review Doc source: .claude/docs/RecordingIntegrity.md Recorders: Hazel/src/Hazel/Data/Recorders/ Writers: Hazel/src/Hazel/Data/Writers/

Read this first

Findings in recording-critical files are auto-promoted to must-fix by /cr. Treat every patch that touches a recorder, writer, observer, or the Acquisition / Export phase as a higher-rigour review. Silent data corruption is the worst-case outcome — you'd rather fail loudly than emit a quietly broken dataset.

What is "recording-critical"

A code path is recording-critical if it directly produces, schedules, validates, or finalises observation / dataset output. Concretely:

Hazel/src/Hazel/Data/Recorders/* — RobotRecorder, MujocoFullRecorder, DroneRecorder, CameraRecorder, DataRecorder.
Hazel/src/Hazel/Data/Writers/* — ParquetDataWriter, FFmpegWriter, MaskImageWriter, WriterInterfaces.
Hazel/src/Hazel/Data/Observer.{h,cpp} — the central observation layer.
Hazel/src/Hazel/Data/EpisodeReportStreamer.{h,cpp} — episode report → LuckyHub streaming.
TimeManager Acquisition and Export phase callbacks (where data is captured and emitted).
GrpcStepSystem / GrpcCapturePool — when they snapshot state for over-the-wire delivery.
The Headless / training entry path (HeadlessApplication, EpisodeRunner-style code) when used in a recording configuration.

Rule: changes touching any of the above are reviewed at higher rigour than ordinary engine code. /cr auto-promotes findings in these paths to must-fix.

Recording lifecycle

The episode state machine governs when on-disk state may be trusted. Crash-safe layout means an interrupted episode is observable as "in progress" or as "not present", never as "complete" but with a truncated payload.

Recording state machine. Blue = forward path, green = finalisation, red = failure / rejection. Crash from Recording leaves partial tmp output but no manifest claim — downstream tooling skips it.

The four integrity guarantees

Every recording must satisfy all four.

1. Completeness

Every step that ran produced a row. Every camera frame captured got written. No silent skips.

No try { ... } catch (...) { /* warn and continue */ } in a recording path. If a write fails, the episode is corrupt — fail the episode, don't cover it up.
No "best-effort" file ops where the failure mode is data loss. WARN-on-failure is acceptable for diagnostics; WARN-and-keep-recording is not.
If a frame is dropped (e.g., camera capture overrun), it must be detected and either backfilled with a sentinel or recorded as a gap with explicit metadata. Never silently emit row N+1 after row N-1.

2. Determinism

Re-running the same simulation with the same seed produces byte-identical recorded data (within the engine's defined determinism contract — see Mode::HighPerformanceDeterministic in TimeManager).

No unordered containers (std::unordered_map, std::unordered_set) for iteration-order-sensitive data in recording paths. Use std::map or insertion-ordered containers.
No platform-dependent math results in fields that go to disk. Watch for sinf/cosf discrepancies between MSVC and Clang on Linux / macOS — pin to a deterministic library or fixed-precision representation.
No timing-dependent fields (steady_clock::now() deltas) recorded as values that downstream consumers compare across runs. Record absolute times (or simulation tick counts) only.
RNG state is seeded explicitly and lives in known places. Never call thread-local RNGs from a recording path.

3. Atomicity

If the editor crashes mid-episode, the on-disk state is one of:

"Episode N was complete and finalised" — fully written, parquet / video closed, manifest updated.
"Episode N was in progress, never finalised" — partial files exist but the manifest does not claim the episode is complete; downstream tooling skips it.

There's no third state where the manifest says "complete" but the parquet truncates mid-row, or the video has a half-written frame.

The manifest update is the last write of an episode. Everything else (parquet rows, video frames, JSON sidecars) must be flushed and synced before the manifest is touched.
Crash-safe layout: write to episode-N.tmp/, atomically rename to episode-N/ only on success. Or use a marker file (.complete) that's the last thing written.
Don't update a "current state" pointer or registry until the new state is fully on disk.

4. Schema fidelity

Every recorded field has a documented type, shape, and unit. Schema changes are versioned.

ObservationSchema mismatches (declared vs. actually-written) must fail loudly. Don't silently coerce.
Adding a new column requires a schema version bump and either backfill or explicit handling for old datasets.
Don't change the meaning of an existing column (e.g., switch units from radians to degrees) without a new column name. Old datasets become invalid otherwise.

Performance constraints in recording paths

Even though most recording happens at 30 Hz (well under per-frame budgets), the constraints are tighter than they look because episodes can run for hours and accumulate millions of writes.

No per-row allocations in tight write loops. Pre-size buffers, reuse them. A std::string constructed per row across a 10M-row episode is 10M allocations.
No file-handle churn. Open writer once per episode, write through it, close at episode end. Don't open / close per row.
No format-string overhead per write. If you're calling std::format per row, batch into a column writer instead.
Camera / video paths: see FFmpegWriter for the existing pattern. Don't decode / encode synchronously on the simulation thread; the writer should buffer and let an encoder thread drain (this is already in place — extend it, don't bypass it).

Concurrency rules in recording paths

Most recording happens on TimeManager runners (Acquisition for capture, Export for write-out). The rules from Threading apply, plus:

No race between writer and reader of the recording buffer. If the simulation thread fills a buffer that an encoder thread drains, the handoff goes through a synchronised queue or a double-buffer with explicit barriers.
No mid-step finalisation. An episode's final write must happen at a tick boundary, not mid-tick. Otherwise some streams may have row N and others row N-1 for the "last" step.
gRPC capture is a snapshot, not a live read. Don't stream out of buffers that are still being mutated by the simulation thread.

What `/cr` flags as must-fix

Findings auto-promoted to must-fix when the file is recording-critical:

Pattern	Why it's must-fix
`catch (...) { warn; continue; }` in a writer	Violates Completeness — silently drops data
`std::unordered_map` iterated to disk	Violates Determinism — order varies by platform / build
`std::format` / `std::string` allocation in a per-row loop	Performance constraint — accumulates over long episodes
Manifest update before stream flush	Violates Atomicity — crash mid-flush leaves "complete" lying about partial data
New column added without schema version bump	Violates Schema fidelity — old readers misinterpret data
Mutex held across file write	Stalls the simulation thread; in `RealtimeNonDeterministic` mode this drops ticks
`steady_clock::now()` recorded as cross-run comparison field	Violates Determinism
File-handle open / close inside a hot write loop	Performance constraint + can leave dangling temp files on crash
Reading from a buffer concurrently being written without a lock / atomic	Race — produces corrupted rows

For consider-tier findings in recording paths, document the suspicion explicitly even if you don't fix — the user reviews these manually.

When you're modifying a recording path

Before opening the file, ask:

Which guarantee am I touching? Completeness, Determinism, Atomicity, or Schema fidelity? (Often more than one.)
What's the failure mode if I'm wrong? Silent corruption is the worst case — flag any change whose worst case is "downstream training silently degrades."
Is there an existing pattern? The recorder / writer family already solves the common cases. Extend rather than re-implementing.
What's the test plan? Can I run an episode and diff its output against a known-good baseline? If not, a code-only review is insufficient — flag this in the PR description.

Don't bundle

If a fix here requires a schema change, version bump, or migration path, that's a separate task. Don't bundle it into an unrelated change.

Data / Observation — the engine-side schema, observer, and recorder family.
Threading — TimeManager phases, worker rules, gRPC marshalling.
.claude/docs/RecordingIntegrity.md — canonical source.