Predict Samples¶
Seven executables demonstrating fingerprint-guided prediction:
- predict_params -- Parameter prediction from structured CSV (GPU kernel tuning, configuration databases).
- predict_sunspots -- Timeseries forecasting via sliding-window kNN (monthly sunspot numbers, 1749-present).
- predict_earthquakes -- Spatial prediction of earthquake magnitude from location and depth (USGS catalog).
- predict_discharge -- River discharge forecasting via sliding-window kNN with day-of-year seasonality (USGS NWIS daily streamflow).
- predict_soi -- Southern Oscillation Index prediction from anti-correlated Tahiti/Darwin sea level pressure using SPREAD decomposition (sum/diff modes).
- predict_stock -- Paired-stock timeseries prediction from CSV using SPREAD decomposition on two correlated price series.
- predict_crystal -- Crystal system classification from composition features (AFLOW ICSD, 7 classes, 60K entries).
Build¶
make
Or from the LIBXS root:
make GNU=1 samples/predict
predict_params¶
Train a prediction model from a CSV file and save it for later use. Finds the optimal training fraction and polynomial order automatically. Reports validation quality on a held-out subset.
Usage¶
./predict_params.x [fraction] [auto|cat|interp] [-N] <csvfile> [modelfile]
fraction Validation split 0..1 for quality report (default: 0.8).
The full model always trains on all entries.
auto Auto-detect mode per output (default).
cat Force categorical (kNN) for all outputs.
interp Force interpolation for all outputs.
-N Max polynomial order for final build (default: 0 = auto).
csvfile Delimited text file (semicolons, commas, or tabs).
The first line may be a header (auto-skipped if non-numeric).
modelfile Output path for the binary model.
Default: derived from CSV basename (e.g., data.csv -> data.bin).
Example¶
./predict_params.x ../../samples/smm/params/tune_multiply_PVC.csv
predict_sunspots¶
Timeseries forecasting using sliding-window nearest-neighbor prediction. The recent history (window of W values) serves as input; the next H values are predicted as output. The kNN confidence indicates whether similar patterns were seen in training.
Usage¶
./predict_sunspots.x <csvfile> [train_fraction]
csvfile Semicolon-delimited timeseries (SILSO sunspot format).
train_fraction Fraction of data used for training (default: 0.8).
Example¶
./predict_sunspots.x predict_sunspots.csv 0.8
Loaded 3328 monthly sunspot values from predict_sunspots.csv
Window=12, Horizon=6, Train=2650, Test=666
Built: 51 clusters, 32.0x compression, order=2
Forecast quality (661 test windows):
step avg-err max-err
t+1 17.58 88.10
t+2 19.48 115.00
t+3 21.01 107.10
t+4 21.76 114.50
t+5 22.84 118.40
t+6 24.26 153.20
avg confidence: 1.000
Data Source¶
Monthly mean total sunspot number from SILSO (World Data Center, Royal Observatory of Belgium). Semicolon-delimited: year, month, decimal_year, sunspot_number, std_dev, obs_count, marker. "Source: WDC-SILSO, Royal Observatory of Belgium, Brussels"
predict_earthquakes¶
Predict earthquake magnitude from geographic location and depth. This is a spatial prediction problem (not timeseries): given where an earthquake occurs, what magnitude is expected based on historical patterns at nearby locations?
Usage¶
./predict_earthquakes.x <usgs_csv> [train_fraction]
usgs_csv USGS earthquake catalog CSV (comma-delimited).
train_fraction Fraction of data for training (default: 0.8).
Example¶
./predict_earthquakes.x predict_earthquakes.csv
Loaded 19619 earthquake events from predict_earthquakes.csv
Inputs: latitude, longitude, depth -> Output: magnitude
Train=15695, Test=3924
Built: 125 clusters, 83.9x compression, order=2
Prediction quality (3924 test events):
avg magnitude error: 0.272
max magnitude error: 2.700
avg confidence: 0.649
Data Source¶
USGS Earthquake Hazards Program (public domain, US Government). Comma-delimited: time, latitude, longitude, depth, mag, magType, ...
predict_discharge¶
River discharge (streamflow) forecasting using sliding-window kNN with day-of-year as an additional input dimension to capture seasonality. Uses log-transform on outputs (via API) for heavy-tailed data. Predicts the next 7 days from the previous 14 days + rate-of-change derivatives.
Usage¶
./predict_discharge.x <discharge_tsv> [train_fraction]
discharge_tsv USGS NWIS daily discharge (tab-delimited RDB format).
train_fraction Fraction of data for training (default: 0.8).
Example¶
./predict_discharge.x predict_discharge.tsv
Loaded 9135 daily discharge values from predict_discharge.tsv
Window=14 (+3 diffs +day-of-year), Horizon=7, Train=7294, Test=1827
Built: 85 clusters, 58.4x compression, order=2
Forecast quality (1821 test windows):
step avg-err max-err
t+1 644.5 17726.6
t+2 769.0 21363.5
t+3 877.9 23199.8
t+4 963.7 24193.4
t+5 1044.1 25086.3
t+6 1131.3 25735.3
t+7 1225.9 26729.3
avg confidence: 1.000
Data Source¶
USGS National Water Information System (public domain, US Government). Colorado River at Lees Ferry, site 09380000. Tab-delimited RDB, comment lines start with #, data columns: agency_cd, site_no, datetime, discharge_value, qualification_code.
predict_soi¶
Southern Oscillation Index prediction from anti-correlated sea level
pressure at Tahiti and Darwin. Demonstrates cross-series decomposition
via libxs_predict_set_decompose(LIBXS_PREDICT_SPREAD): the sum/diff
modes separate the common trend from the anti-correlated signal,
making the spread (which is the SOI) easier to predict.
Usage¶
./predict_soi.x <tahiti_file> <darwin_file> [train_fraction]
tahiti_file NOAA CPC monthly Tahiti SLP (fixed-width).
darwin_file NOAA CPC monthly Darwin SLP (fixed-width).
train_fraction Fraction of data for training (default: 0.8).
Example¶
./predict_soi.x predict_soi_tahiti.dat predict_soi_darwin.dat
Data Source¶
NOAA Climate Prediction Center (public domain, US Government). - Tahiti: https://www.cpc.ncep.noaa.gov/data/indices/tahiti - Darwin: https://www.cpc.ncep.noaa.gov/data/indices/darwin
Fixed-width: YEAR followed by 12 monthly sea level pressure values (mb above 1000 mb). Data from 1951 to present.
predict_stock¶
Multi-stock timeseries prediction from a CSV file containing one
or more correlated price series. Uses libxs_predict_set_diff(model, 0)
for automatic differencing (non-stationary data) and PCA decomposition
for 3+ series (SPREAD for 2). Compares against a raw concatenation
baseline and a single-series baseline to show whether the cross-series
signal improves forecasts.
Usage¶
./predict_stock.x <csv_file> [columns] [train_fraction]
csv_file Comma-delimited CSV with header row.
columns Comma-separated 0-based column indices (default: 1,2).
train_fraction Fraction of data for training (default: 0.8).
Example¶
./predict_stock.x stocks.csv 1,2,3
Data Format¶
CSV with a header line; numeric columns selected by index:
Date,Stock1,Stock2
2007-01-03,19.520000457763672,12.165162086486816
2007-01-04,19.790000915527344,12.655348777770996
...
Data Source¶
Historical stock price CSV from Bulk Stock Data Downloader.
predict_crystal¶
Crystal system prediction (7-class classification) from chemical
composition features. Demonstrates LIBXS_PREDICT_FISHER for
automatic feature weighting via Fisher's discriminant criterion,
or LIBXS_PREDICT_RF for Random Forest classification.
Usage¶
./predict_crystal.x <crystal_csv> [train_fraction] [order] [nclusters]
crystal_csv CSV with composition features + crystal_system label.
train_fraction Fraction of data for training (default: 0.8).
Example¶
./predict_crystal.x predict_crystal.csv
Loaded 60386 entries (37 features) from predict_crystal.csv
Train=48309, Test=12077
Built: 220 clusters, 208.7x compression, order=2
Accuracy: 9625/12077 = 79.7%
Confidence-gated (>=0.9): 6167/6502 = 94.8% (coverage 53.8%)
Avg confidence: 0.820
Data Source¶
AFLOW ICSD catalog (free for academic use).
60,386 entries with Magpie-style composition features (7 elemental
properties x 5 statistics + 2 counts = 37 features). Crystal systems:
triclinic(1), monoclinic(2), orthorhombic(3), tetragonal(4),
trigonal(5), hexagonal(6), cubic(7).
Data preparation: prepare_crystal.py.
How It Works¶
All samples share the same prediction library (libxs_predict):
-
predict_params: Each CSV row is an independent (inputs, outputs) pair. The model learns spatial relationships in the input space and predicts outputs for unseen input combinations.
-
predict_sunspots: Uses
libxs_predict_set_seriesto declare timeseries structure; the framework constructs sliding windows internally from accumulated timesteps. The model finds historically similar windows and predicts the continuation. The kNN confidence reflects how well the current pattern matches training history. -
predict_earthquakes: Each earthquake event provides (lat, lon, depth) as inputs and magnitude as output. The model finds geographically similar past events and predicts expected magnitude for new locations.
-
predict_discharge: Combines temporal sliding-window (14 days) with day-of-year seasonality as an extra input dimension. Log-transform on outputs (via
libxs_predict_set_transform) handles heavy-tailed discharge data transparently. -
predict_soi: Two anti-correlated series (Tahiti and Darwin sea level pressure) feed a single model via
set_series(2, W). SPREAD decomposition transforms the stacked windows into sum/diff modes before kNN matching, exploiting the anti-correlation structure that defines the Southern Oscillation Index. -
predict_stock: Two stock price series loaded via
libxs_predict_load_csvwith numeric column indices.libxs_predict_set_diff(model, 0)auto-detects that stock prices are non-stationary and differences them before window construction; at eval, predicted differences are integrated back to absolute prices. SPREAD decomposition separates sum (market trend) from diff (relative strength), making correlated equities easier to predict. -
predict_crystal: 37 composition features predict one of 7 crystal systems. Uses
LIBXS_PREDICT_RF(Random Forest) for 79.7% accuracy, orLIBXS_PREDICT_FISHER(kNN with automatic feature weighting) for 70.7%. Confidence gating raises accuracy to 95%+ on the reliable subset.
The fingerprint automatically determines per-output whether polynomial interpolation or distance-weighted kNN voting is more appropriate. Per-output confidence and variance scores enable the caller to gate predictions and fall back to safe defaults when the model is uncertain.
When confidence is low (<0.7), the framework automatically expands to multi-cluster blending, improving predictions by averaging over distinct regimes (e.g., -4% MAE on earthquake magnitude prediction).
Timeseries samples use LIBXS_PREDICT_TEMPORAL mode which enables
recency weighting (recent neighbors preferred), continuous-valued
output without snap-to-nearest discretization, and local coherence
smoothing across horizon steps. These heuristics also auto-enable
for any timeseries model (nseries > 0) when the query falls outside
the training bounding box.
The sunspot sample uses libxs_predict_set_series to declare
timeseries structure: instead of manually constructing sliding windows,
the caller pushes one timestep at a time (with outputs=NULL) and the
framework builds all valid windows at build time. For multiple
co-observed series, libxs_predict_set_decompose(LIBXS_PREDICT_SPREAD)
transforms the stacked windows into sum/diff modes, exploiting
anti-correlation between series (e.g., one's gain is the other's loss).
The sunspot sample additionally demonstrates forward-inverse-forward
iteration: predicting outputs, finding the canonical historical window
via libxs_predict_inverse, then re-predicting from that pattern.
This reduces worst-case errors (-17% max error) at the cost of slight
average regression -- a variance-bias tradeoff useful for applications
where avoiding catastrophic predictions matters most.