BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed).
Comparative evaluation of LLMs (GPT-4o, Llama 3.1, Phi3, Gemma2), a fine-tuned RoBERTa model (BeigeSage), and a lexical baseline (VADER) for sentiment classification of Federal Reserve Beige Book reports across positive, negative, and mixed classes. Demonstrates that domain-specific fine-tuning yields the best classification performance and that BeigeSage sentiment correlates with macroeconomic indicators (GDP growth, unemployment, CPI).
BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed).
Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs.
GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model.
Domain: Beige Book Sentiment Classification
Application of large language models, fine-tuned BERT-family models, and lexical methods to classify sentiment (positive, negative, mixed) in the Federal Reserve’s Beige Book reports, and evaluation of the resulting sentiment series against macroeconomic indicators.
Temporal scope: 1970-2023 (Beige Book corpus); evaluation 2007-2023 sample | Population: Federal Reserve Beige Book report chunks (256-token segments)
…and 4 more findings
BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed).
Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs.
GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model.
Phi3 (3.8B parameters, Microsoft, zero-shot) achieved accuracy=0.62, F1=0.60, MCC=0.45, demonstrating that small open-source models can approach larger closed-source LLMs on Beige Book sentiment.
Gemma2 (9B parameters, Google, zero-shot) achieved accuracy=0.59, F1=0.55, MCC=0.41, the weakest of the LLMs tested.
VADER (lexical baseline) achieved accuracy=0.49, F1=0.43, MCC=0.27, substantially below all LLMs. VADER over-predicted positive sentiment (76% predicted positive vs. 39% human) and under-predicted mixed (8% vs. 42% human).
Non-fine-tuned LLMs systematically over-predict the mixed class on Beige Book chunks: predicted mixed share ranged 65.5% (Llama) to 78.5% (Gemma2), versus a 41.5% human base rate. BeigeSage matched the human base rate at 40.5%.
BeigeSage achieved much higher recall on the positive class (82%) than non-fine-tuned LLMs (29%-47%), while maintaining the highest negative-class recall (64%). Non-fine-tuned LLMs achieved very high mixed-class recall (90%-95%) only at the cost of large recall losses on positive and negative classes.
BeigeSage sentiment is positively correlated with monthly real GDP growth: r=0.29, p<0.001 (N=470 Beige-Book-period observations).
BeigeSage sentiment is negatively correlated with the U.S. unemployment rate: r=-0.24, p<0.001 (N=470).
BeigeSage sentiment is positively correlated with the Consumer Price Index: r=0.42, p<0.001 (N=470). Higher Beige Book sentiment co-moves with higher inflation, consistent with strong demand environments generating both optimism and price pressure.
BeigeSage sentiment provides incremental forecasting power for GDP growth at 2-3 Beige-Book-period horizons but adds little at 1 period: ΔR² = 0.003 at h=1, 0.022 at h=2, 0.032 at h=3, on top of a baseline AR(3) model whose own R² collapses from 0.451 (h=1) to 0.043 (h=2) and 0.040 (h=3). For unemployment and CPI, sentiment adds smaller increments and the AR baseline already explains most variance.
name: smith-lambert-2026-beigesage version: 1.0.0 pax_type: paper author: Smith, Charlie; Lambert, Joshua license: CC-BY-4.0 published_by: JELambert domain: beige_book_sentiment_classification constructs: - beige_book_sentiment - sentiment_classification_accuracy - sentiment_classification_f1 - sentiment_classification_mcc - model_execution_speed - model_architecture_type - fine_tuning_treatment - mixed_sentiment_overprediction - gdp_growth_rate - unemployment_rate - consumer_price_index engines: counts: constructs: 11 findings: 12 propositions: 3 playbooks: 2 sources: 1