Rosetta¶

Demonstrates hierarchical type discovery on opaque binary data. A table of harmonic-series values is encoded as a flat byte blob with no metadata. The analysis rediscovers the element type, record stride, and per-field structure -- recovering mathematical content from anonymous bytes.

What It Shows¶

The sample proceeds through the levels described in the algebra paper (Section "Hierarchical Type Discovery"):

Step	Tool used	What it finds
Flat probe	libxs_fprint (UNKNOWN)	inconclusive (expected)
Stride sweep	libxs_fprint per-column	f64, stride=6
Shuffle test	libxs_shuffle + fprint	order matters (3-4x)
Field analysis	libxs_fprint per-field	decay rates per column
Sort test	libxs_sort_smooth (GREEDY)	already optimally ordered
Verification	libxs_setdiff_min	0 unmatched, tol=0

Key observations from the output:

The flat 1-D probe fails because interleaved fields of different scales (1/n mixed with n mixed with ln(n)) look like noise when read sequentially. This motivates the stride sweep.
The stride sweep requires ALL columns at a candidate width to have decay < 1 before accepting it, which eliminates false positives from partial correlations at wrong strides.
Shuffle stability confirms that the record order carries genuine structure (decay increases ~4x after coprime permutation).
Per-field analysis reveals:
- [0] decay=0 -- perfect ramp (sequential index)
- [1] decay~0.63 -- 1/n (decaying but not ultra-smooth)
- [2] decay~0.20 -- H_n (partial sums, very smooth)
- [5] decay~0.29 -- converges to Euler-Mascheroni gamma
GREEDY sort confirms the data is already optimally ordered (the natural 1..64 sequence is the smoothest permutation).

The Data¶

For n = 1, 2, ..., 64 the table stores six f64 fields per record:

[0] n            integer index
[1] 1/n          harmonic term
[2] H_n          partial sum of the harmonic series
[3] ln(n)        natural logarithm
[4] H_n - ln(n)  difference (converges to gamma from above)
[5] H_n - ln(n) - 1/(2n)   better gamma estimate

Total: 64 records x 6 fields = 384 doubles = 3072 bytes.

Build¶

cd samples/rosetta
make GNU=1

Usage¶

./rosetta.x

No arguments. The sample is self-contained: it generates the data, encodes it as a flat blob, runs the full analysis, and prints the results.

To collect machine-readable artifacts, set ROSETTA_OUTDIR to an existing directory:

mkdir -p results/rosetta
ROSETTA_OUTDIR=results/rosetta ./rosetta.x

This writes summary.csv, rosetta_original.mhd, rosetta_shuffled.mhd, rosetta_fields.mhd, and rosetta_fields_shuffled.mhd. If LIBXS_MHD_PNG=1 is also set, the MHD writer emits PNG files with the same basenames. The fields images are normalized per field, making the ordered and shuffled record traces easy to compare visually. The artifact mode is optional and leaves the interactive sample output unchanged.

Example Output¶

ROSETTA: Recovering structure from opaque bytes
--------------------------------------------------
Ground truth (hidden from the analysis):
  64 records x 6 fields = 384 doubles (3072 bytes)
  Data: harmonic series H_n and derived quantities.
  Field [5] converges to Euler-Mascheroni gamma.
--------------------------------------------------
Level 0: Flat 1-D probe (all bytes as one sequence)
  Input: 3072 opaque bytes
  No type has decay < 1 -- flat stream is not smooth.
  This is expected: interleaved fields of different
  scales look like noise when read sequentially.
--------------------------------------------------
Stride sweep: discovering record layout
  Best type:   f64
  Best stride: 6 elements (48 bytes per record)
  Records:     64
  Avg decay:   0.288244
--------------------------------------------------
Shuffle stability (record-level)
  Avg decay (original): 0.288244
  Avg decay (shuffled): 1.103839
  Ratio:                3.8x
  -> Record order carries structure.
--------------------------------------------------
Per-field decay analysis
  [0] n           decay=0.000000
  [1] 1/n         decay=0.631512
  [2] H_n         decay=0.203226
  [3] ln(n)       decay=0.259621
  [4] H_n-ln(n)   decay=0.342853
  [5] gamma_est   decay=0.292251  (last=0.5771953203, gamma=0.5772156649)
  Smoothest: [0] n (perfect ramp, decay=0)
  -> This reveals a sequential index column.
  Most interesting: [5] gamma_est -- converges to a constant.
--------------------------------------------------
GREEDY sort test (64 rows x 6 cols)
  Data is already optimally ordered for row smoothness.
--------------------------------------------------
Verification: setdiff(original, blob)
  Unmatched: 0, tolerance: 0.00e+00
  -> Byte-perfect recovery.
--------------------------------------------------

Why This Is Interesting¶

From 3072 anonymous bytes the framework discovers:

The element type (f64) -- not i32, not f32, not i64.
The record structure (6-field records of 48 bytes each).
A sequential index column (field [0], decay = 0).
A column converging to Euler-Mascheroni gamma (field [5]).
That the natural ordering is already optimal (no resorting).

No metadata, no format knowledge, no human guidance. The hierarchical composition -- stride sweep with all-columns-must-pass filtering, shuffle stability, per-field decay ranking -- is what makes this possible. Any single tool alone would either fail (flat probe) or produce ambiguous results (stride sweep without the all-columns requirement accepts wrong strides).