# Practical Ozaki Schemes

## High-Precision GEMM via Low-Precision Hardware

LIBXS and LIBXSTREAM

---

## Latest Revision of the Slides

https://libxs.readthedocs.io/ozaki/

---

## The Transistor Cost of FP64

| Unit     | Transistors | Relative |
|----------|-------------|----------|
| FP64 FMA | ~150k       | 1×       |
| FP16 FMA | ~15k        | 0.1×     |
| INT8 MAC | ~5k         | 0.03×    |

Same die area: ~30× more INT8 MACs than FP64 FMAs.

---

## Fairness: Matrix Engines

- **Fair**: tensor core INT8 vs. tensor core FP64
  (both dense matrix engines, same data-flow model)
- **Unfair**: tensor core INT8 vs. scalar FP64 ALU
  (scalar unit: register-file and scheduling overhead)

Ozaki targets matrix engines: VNNI/AMX on CPU, DPAS/tensor cores on GPU.

---

## Hardware Targets

| Target        | Instruction  | Operands | Status  |
|---------------|--------------|----------|---------|
| AVX-512 VNNI  | VPDPBUSD     | u8 × s8  | default |
| AVX-512 INT8  | VPDPBUUD     | u8 × u8  | auto    |
| Intel Xe DPAS | dpas.8×8     | u8/s8    | default |
| NVIDIA        | dp4a         | u8/s8    | default |
| NVIDIA        | MMA m16n8k32 | u8/s8    | auto    |

CPU selects VNNI/AMX automatically. NVIDIA: dp4a reuses the DPAS-style
layout; MMA uses m16n8k32. Both are mainly limited by 2D block loads.

---

## What Is Ozaki?

Replaces BLAS GEMM with low-precision arithmetic.

- **Scheme 1**: Mantissa slicing (int8) — quadratic in slices
- **Scheme 2**: Chinese Remainder Theorem (CRT) — linear in primes
- **Adaptive**: Automatic selection based on data characteristics

Built-in accuracy tracking tells if results are trustworthy.

---

## How It Works

```text
Application calls DGEMM/SGEMM/ZGEMM/CGEMM
            │
            ▼
    libwrap.so intercepts the call (LD_PRELOAD)
            │
            ▼
    Decompose FP values into int8 slices or u8 residues
            │
            ▼
    Low-precision dot products (VNNI on CPU, DPAS on GPU)
            │
            ▼
    Reconstruct full-precision result
```

---

## Build

```bash
git clone https://github.com/hfp/libxstream.git
git clone https://github.com/hfp/libxs.git

cd libxs/samples/ozaki
make GNU=1 -j $(nproc)
```

Produces `libwrap.so`, `libwrap.a`, and test drivers
(`dgemm-wrap.x`, `sgemm-wrap.x`, `zgemm-wrap.x`, `cgemm-wrap.x`).

---

## Quick Sanity Check

```bash
./dgemm-wrap.x 256
```

```text
dgemm('N', 'N', 256/*m*/, 256/*n*/, 256/*k*/,
  1/*alpha*/, 0x7cf7c8a02010/*a*/, 256/*lda*/,
              0x7cf7c8981010/*b*/, 256/*ldb*/,
   1/*beta*/, 0x7cf7c8900010/*c*/, 256/*ldc*/)
OZAKI GEMM: 27.3 ms (1.2 GFLOPS/s)
BLAS GEMM:  0.2 ms (214.3 GFLOPS/s)
...
```

---

## Deploy via LD_PRELOAD

```bash
LD_PRELOAD=/path/to/libwrap.so ./your_application
```

DGEMM/SGEMM/ZGEMM/CGEMM calls are intercepted automatically.

No recompilation needed.

---

## MPI: Getting LD_PRELOAD Right

The wrapper must reach every MPI rank, not just mpirun itself.

### Wrong (wrapper only in the launcher process)

```bash
LD_PRELOAD=libwrap.so mpirun -np 4 ./app
```

### Correct (wrapper propagated to each rank)

```bash
mpirun -np 4 env LD_PRELOAD=/path/to/libwrap.so ./app
```

---

## SLURM and MPI

- **SLURM**: `srun -n 4 --export=ALL,LD_PRELOAD=/path/to/libwrap.so ./app`
- **OpenMPI**: `mpirun -np 4 -x LD_PRELOAD=/path/to/libwrap.so ./app`
- **Intel MPI**: `mpiexec -n 4 -genv LD_PRELOAD /path/to/libwrap.so ./app`

---

## Validate and Check Accuracy

Before production, enable statistics and set a quality threshold:

```bash
export OZAKI_VERBOSE=1      # print summary at exit
export OZAKI_RSQ=0.95       # dump matrices if R² drops below 0.95
```

Run your workload and check the exit summary.

```text
DGEMM[3|410814]: linf=7.81597e-14 linf_rel=2.23531e-15 l2_rel=6.0263e-13 eps=5.20675e-16 rsq=1
```

In brackets are the call-count followed by the PID.

---

## Reading the Statistics

| Metric   | What it tells you       | Healthy value |
|----------|-------------------------|---------------|
| linf     | Max abs. error          | ≈0            |
| linf_rel | Max rel. error          | <1e-10 (FP64) |
| l2_rel   | RMS rel. error          | <1e-12 (FP64) |
| eps      | Frobenius               | <1e-10 (FP64) |
| rsq      | R² (best single metric) | 1.0 = perfect |

R² is the important number. Investigate if below 0.99.

---

## Monitoring Jobs

For HPC jobs that run hours or days, track accuracy over time:

```bash
export OZAKI_VERBOSE=1000   # print stats every 1000th GEMM
export OZAKI_RSQ=0.9        # auto-dump problematic matrices
export OZAKI_EXIT=0         # keep running after threshold violation
```

Watch for R² degradation — it indicates ill-conditioned matrices
or insufficient decomposition depth.

---

## Debugging Accuracy

When `OZAKI_RSQ` or `OZAKI_EPS` thresholds are exceeded,
A and B matrices are dumped as MHD or PNG files (viewers available).

Reproduce the problem offline:

```bash
./dgemm-wrap.x gemm-292284-0-500-a.mhd gemm-292284-0-500-b.mhd
```

- Dump GEMM call by counter using `OZAKI_DUMP`.
- Try to increase the accuracy; see [Details](https://libxs.readthedocs.io/samples/libxs_ozaki/).

---

## Scheme Selection

| Setting | Cost model              | Best for                              |
|---------|-------------------------|---------------------------------------|
| OZAKI=2 | P integer GEMMs (fixed) | General use, large matrices (default) |
| OZAKI=1 | S·(S+1)/2 integer GEMMs | Narrow exponent spans                 |
| OZAKI=3 | Auto-selects 1 or 2     | Repeated calls, unknown data          |

P = number of primes (default 16 for FP64).
S = number of slices (default 8 for FP64).

Auto-selection only on GPU (Ozaki-2 on CPU).

---

## Tuning: Accuracy vs Speed

```bash
export OZAKI_TRIM=7         # drop seven precision levels (faster)
export OZAKI_THRESHOLD=0    # apply Ozaki to ALL GEMMs (default: 12)
export OZAKI_N=12           # more slices/primes (more accurate)
```

The `OZAKI_THRESHOLD` controls minimum arithmetic intensity.
GEMMs below the threshold fall through to the original BLAS.

---

## GPU Offload (LIBXSTREAM)

When built with LIBXSTREAM, a GPU path is available:

```bash
export OZAKI_OCL=1          # enable GPU offload
export OZAKI_OCL=0          # CPU only (default)

export OZAKI=1              # Adaptive cut-off
export OZAKI=2              # CRT scheme (default)
export OZAKI=3              # Auto-select scheme
```

Auto-selection uses `OZAKI_CACHE` as a bonus.

---

## GPU Offload Cache

Avoid to re-transfer the A-matrix or the B-matrix.

```bash
export OZAKI_CACHE=1        # A-matrix
export OZAKI_CACHE=2        # B-matrix
export OZAKI_CACHE=3        # A and B
```

The content is finger-printed,
but can be still wrong!

---

## Profiling

```bash
export OZAKI_PROFILE=1      # all phases (preprocessing + kernel)
export OZAKI_PROFILE=2      # kernel only (dot/matrix products)
export OZAKI_PROFILE=3      # preprocessing only
```

At program exit:

```text
OZAKI PROF: 850 DP-GFLOPS/s (17.0 INT8-TOPS/s, 20×)
```

Reports effective GFLOPS/s and derived INT8 throughput.
Same histogram for CPU and GPU.

---

## Complex GEMM

Complex GEMM is mapped to real GEMM internally.
The real GEMM is wrapped automatically.

```bash
export OZAKI_COMPLEX=0      # BLAS as linked
export OZAKI_COMPLEX=1      # CPU Ozaki
export OZAKI_COMPLEX=2      # GPU+fallback (default)
```

---

## Complete SLURM Example

```bash
#!/bin/bash
#SBATCH -o %x-%j.txt
#SBATCH -J yourapp
#SBATCH --time=01:00:00
#SBATCH --exclusive
#SBATCH --ntasks-per-node=16
#SBATCH -N 1

export LD_PRELOAD=$HOME/libxs/samples/ozaki/libwrap.so
export OZAKI=2              # CRT scheme (default)
export OZAKI_VERBOSE=1000   # monitor every 1000th GEMM
export OZAKI_RSQ=0.9        # dump if accuracy drops

srun --export=ALL ./yourapp.x workload.inp
```

---

## Troubleshooting

**Segfault on startup?**
The application may statically link BLAS.
Use --wrap instead:

```bash
gcc -o app app.o -lwrap \
  -Wl,--wrap=dgemm_ -Wl,--wrap=sgemm_ \
  -Wl,--wrap=zgemm_ -Wl,--wrap=cgemm_ \
  -llapack -lblas
```

**No Ozaki output at all?**
Verify the wrapper is loaded:

```bash
ldd ./app | grep libwrap
```

---

## Checklist

1. **Build**: `cd libxs/samples/ozaki && make GNU=1 -j`
2. **Deploy**: `mpirun -np N env LD_PRELOAD=./libwrap.so ./app`
3. **Validate**: `OZAKI_VERBOSE=1 OZAKI_RSQ=0.95`
4. **Monitor**: `OZAKI_VERBOSE=1000 OZAKI_EXIT=0`
5. **Debug**: `./dgemm-wrap.x dumped-a.mhd dumped-b.mhd`
6. **Tune**: `OZAKI=1 OZAKI_TRIM=7`

---

## The Quadratic Cost — Here or There?

Silicon, Ozaki-1, and Ozaki-2 all scale ~w² with precision width.
Ozaki shifts this cost from **transistors to algorithm** —
the ~30× unit cost margin (INT8 vs. FP64) absorbs it.

Only GEMM (O(n³) compute on O(n²) data) can pay this cost.
Reductions, triangular solves, element-wise ops — still need native FP64.

---

## Hardware Design Implications

Ozaki changes the hardware trade-off:

- **AI-dominated**: minimal FP64, strong LP — viable for HPC GEMM via reconstruction
- **Balanced**: moderate FP64 + LP — how about CPU?
- **HPC-dominated**: wide FP64 — solver-heavy, etc.

FP64 stagnation already due to AI dominance.

---

## When LP Wins & When Not

- Favorable: compute-bound GEMM, LP:DP absorbs overhead (16–36×)
- Unfavorable: small matrices, bandwidth-bound, low LP:DP ratio
- Structural: only GEMM — nothing else has the compute density

---

## Questions?

- Hans Pabst (hans.pabst @ intel.com)
- https://github.com/hfp/libxs
- https://github.com/hfp/libxstream