$ ⌘K

Quick Start — Evaluate Sentiment Classifiers on Beige Books

step_count: 10· runtime: 5-30 minutes (depending on hardware and remote API latency)

Reproduce the headline comparison from Smith & Lambert (2026): score the 200-chunk human-labelled test set with each model, compute accuracy / macro F1 / MCC, then check correlations against GDP growth, unemployment, and CPI. Assumes you have already trained or downloaded BeigeSage and have API access / local weights for the comparison LLMs. For end-to-end training-from-scratch, see the `reproduce_beigesage` playbook.

// pipeline
10 steps· DAG
01

Fetch labelled test set from OSF

action download
config (3 keys)
{
  "expected_columns": [
    "chunk_id",
    "text",
    "human_label",
    "human_score"
  ],
  "target": "./data/beige_book_test_200.csv",
  "url": "https://osf.io/xq35t/"
}
02

Data Quality Gate

action data_quality_gatefetch_test_seton fail: abort
config (4 keys)
{
  "class_distribution_check": {
    "mixed": 0.415,
    "negative": 0.195,
    "positive": 0.39,
    "tolerance": 0.05
  },
  "constructs": [
    "beige_book_sentiment"
  ],
  "max_missing_pct": 0,
  "min_observations": 200
}
03

Score test set with BeigeSage

action classifydata_quality_check
config (4 keys)
{
  "batch_size": 16,
  "framework": "huggingface_transformers",
  "max_length": 256,
  "model": "beigesage"
}
04

Score test set with zero-shot LLMs

action classify_zero_shotdata_quality_check
config (5 keys)
{
  "models": [
    "gpt-4o",
    "llama-3.1-8b",
    "phi3-3.8b",
    "gemma2-9b"
  ],
  "output_format": "json",
  "parse_key": "label",
  "prompt_template_ref": "zero_shot_prompt",
  "retry_on_invalid_json": 3
}
05

Score test set with VADER baseline

action classify_lexicaldata_quality_check
config (2 keys)
{
  "bin_thresholds": {
    "negative_below": -0.2,
    "positive_above": 0.2
  },
  "method": "vader_compound"
}
06

Per-Model Classification Metrics

engine classification_metricsscore_beigesage, score_llms_zero_shot, score_vader
config (3 keys)
{
  "gold": "human_label",
  "metrics": [
    "accuracy",
    "macro_f1",
    "mcc",
    "per_class_precision",
    "per_class_recall"
  ],
  "predictions": [
    "beigesage_prediction",
    "gpt4o_prediction",
    "llama_prediction",
    "phi3_prediction",
    "gemma2_prediction",
    "vader_prediction"
  ]
}
expected results (6 keys)
{
  "beigesage": {
    "accuracy": 0.71,
    "macro_f1": 0.71,
    "mcc": 0.55
  },
  "gemma2": {
    "accuracy": 0.59,
    "macro_f1": 0.55,
    "mcc": 0.41
  },
  "gpt_4o": {
    "accuracy": 0.63,
    "macro_f1": 0.62,
    "mcc": 0.48
  },
  "llama_3_1_8b": {
    "accuracy": 0.68,
    "macro_f1": 0.67,
    "mcc": 0.52
  },
  "phi3": {
    "accuracy": 0.62,
    "macro_f1": 0.6,
    "mcc": 0.45
  },
  "vader": {
    "accuracy": 0.49,
    "macro_f1": 0.43,
    "mcc": 0.27
  }
}
07

Predicted vs. Human Class Distribution

engine confusion_distributionclassification_metrics
config (2 keys)
{
  "gold": "human_label",
  "predictions": [
    "beigesage_prediction",
    "gpt4o_prediction",
    "llama_prediction",
    "phi3_prediction",
    "gemma2_prediction",
    "vader_prediction"
  ]
}
expected results (3 keys)
{
  "mixed_share_beigesage": 0.405,
  "mixed_share_human": 0.415,
  "mixed_share_non_finetuned_range": [
    0.655,
    0.785
  ]
}
08

Score full 1970-2023 Beige Book corpus with BeigeSage

action classify_corpusclassification_metrics
config (5 keys)
{
  "aggregate_to_period": true,
  "input": "./data/beige_books_1970_2023_chunks.csv",
  "model": "beigesage",
  "output": "./data/beigesage_full_corpus.csv",
  "score_normalization": "categorical_to_signed_continuous"
}
09

BeigeSage Sentiment vs. Macro Indicators

engine correlation_matrixscore_full_corpuscompare_to_kb
config (2 keys)
{
  "data_sources": {
    "consumer_price_index": "FRED CPIAUCSL",
    "gdp_growth_rate": "Brave-Butters-Kelley monthly real GDP growth (Chicago Fed)",
    "unemployment_rate": "FRED UNRATE"
  },
  "variables": [
    "beige_book_sentiment",
    "gdp_growth_rate",
    "unemployment_rate",
    "consumer_price_index"
  ]
}
expected results (3 keys)
{
  "beige_book_sentiment↔consumer_price_index": {
    "direction": "positive",
    "n": 470,
    "p": 0.001,
    "r": 0.42
  },
  "beige_book_sentiment↔gdp_growth_rate": {
    "direction": "positive",
    "n": 470,
    "p": 0.001,
    "r": 0.29
  },
  "beige_book_sentiment↔unemployment_rate": {
    "direction": "negative",
    "n": 470,
    "p": 0.001,
    "r": -0.24
  }
}
10

AR(3) + Sentiment Incremental Forecast

engine autoregression_with_predictormacro_correlations
config (4 keys)
{
  "dependents": [
    "gdp_growth_rate",
    "unemployment_rate",
    "consumer_price_index"
  ],
  "horizons": [
    1,
    2,
    3
  ],
  "lags": 3,
  "predictor": "beige_book_sentiment"
}
expected results (3 keys)
{
  "consumer_price_index": {
    "delta_r2_h3": 0.003
  },
  "gdp_growth_rate": {
    "baseline_r2_h1": 0.451,
    "delta_r2_h1": 0.003,
    "delta_r2_h2": 0.022,
    "delta_r2_h3": 0.032
  },
  "unemployment_rate": {
    "delta_r2_h3": 0.035
  }
}
// from pax
// engines
engine.correlation_matrixengine.autoregression_with_predictorengine.classification_metricsengine.confusion_distribution
// note
step bodies extracted from the .pax archive at build time. download the parent pax for the full yaml.
[ download smith-lambert-2026-beigesage.pax.tar.gz ]