$ ⌘K
P paper

Smith & Lambert (2026) — Comparative evaluation of LLMs (GPT-4o, Llama 3

v1.0.0 ·Beige Book Sentiment Classification

Comparative evaluation of LLMs (GPT-4o, Llama 3.1, Phi3, Gemma2), a fine-tuned RoBERTa model (BeigeSage), and a lexical baseline (VADER) for sentiment classification of Federal Reserve Beige Book reports across positive, negative, and mixed classes. Demonstrates that domain-specific fine-tuning yields the best classification performance and that BeigeSage sentiment correlates with macroeconomic indicators (GDP growth, unemployment, CPI).

constructs
11
findings
12
propositions
3
sources
1
playbooks
2
// domain
Beige Book Sentiment Classification
Federal Reserve Beige Book report chunks (256-token segments)
macro 1970-2023 (Beige Book corpus); evaluation 2007-2023 sample
How do open-source vs. closed-source LLMs compare on economic-text sentiment classification?
Does domain-specific fine-tuning of a BERT-family model outperform general-purpose pre-trained LLMs on Beige Book sentiment?
How do LLMs handle mixed-sentiment passages typical of central-bank communication?
Does Beige Book sentiment capture meaningful signal about macroeconomic outcomes (GDP, unemployment, CPI)?
// top findings
12 empirical claims
view all →
F001 strong

BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed).

effect=0.71· N=200
F002 strong

Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs.

effect=0.68· N=200
F003 strong

GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model.

effect=0.63· N=200
// abstract

Abstract

Domain: Beige Book Sentiment Classification

Application of large language models, fine-tuned BERT-family models, and lexical methods to classify sentiment (positive, negative, mixed) in the Federal Reserve’s Beige Book reports, and evaluation of the resulting sentiment series against macroeconomic indicators.

Temporal scope: 1970-2023 (Beige Book corpus); evaluation 2007-2023 sample | Population: Federal Reserve Beige Book report chunks (256-token segments)

Key Findings

  • BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed). (positive, strong)
  • Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs. (positive, strong)
  • GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model. (positive, strong)
  • Phi3 (3.8B parameters, Microsoft, zero-shot) achieved accuracy=0.62, F1=0.60, MCC=0.45, demonstrating that small open-source models can approach larger closed-source LLMs on Beige Book sentiment. (positive, moderate)
  • Gemma2 (9B parameters, Google, zero-shot) achieved accuracy=0.59, F1=0.55, MCC=0.41, the weakest of the LLMs tested. (positive, moderate)
  • VADER (lexical baseline) achieved accuracy=0.49, F1=0.43, MCC=0.27, substantially below all LLMs. VADER over-predicted positive sentiment (76% predicted positive vs. 39% human) and under-predicted mixed (8% vs. 42% human). (negative, strong)
  • Non-fine-tuned LLMs systematically over-predict the mixed class on Beige Book chunks: predicted mixed share ranged 65.5% (Llama) to 78.5% (Gemma2), versus a 41.5% human base rate. BeigeSage matched the human base rate at 40.5%. (positive, strong)
  • BeigeSage achieved much higher recall on the positive class (82%) than non-fine-tuned LLMs (29%-47%), while maintaining the highest negative-class recall (64%). Non-fine-tuned LLMs achieved very high mixed-class recall (90%-95%) only at the cost of large recall losses on positive and negative classes. (positive, strong)

…and 4 more findings

Theoretical Propositions

  • [+] Domain-specific fine-tuning of a BERT-family transformer on labelled in-domain text outperforms zero-shot prompting of larger general-purpose LLMs on three-class sentiment classification of central-bank economic narratives.
  • [+] General-purpose pre-trained LLMs systematically over-predict the mixed/neutral sentiment class when classifying multi-topic economic narratives, because zero-shot prompting biases the model toward hedging in the presence of any conflicting cues, regardless of which signals are dominant.
  • [+] Sentiment extracted from Beige Book reports by a domain-fine-tuned classifier correlates with contemporaneous and short-horizon-future U.S. macroeconomic conditions (positively with GDP growth and CPI, negatively with unemployment), and contains incremental forecasting information beyond autoregressive lags of those indicators.
// tags
sentiment-analysis llm bert roberta beige-book federal-reserve text-classification economic-forecasting fine-tuning
// registry meta
domainBeige Book Sentiment Classification
levelmacro
populationFederal Reserve Beige Book report chunks (256-token segments)
pax typepaper
version1.0.0
published byJELambert
archive18.3 KB
// research questions
  • How do open-source vs. closed-source LLMs compare on economic-text sentiment classification?
  • Does domain-specific fine-tuning of a BERT-family model outperform general-purpose pre-trained LLMs on Beige Book sentiment?
  • How do LLMs handle mixed-sentiment passages typical of central-bank communication?
  • Does Beige Book sentiment capture meaningful signal about macroeconomic outcomes (GDP, unemployment, CPI)?
// constructs.yaml
11 variables in the pax vocabulary
Each construct names a thing the field measures, with a kind and an authoritative definition.
C beige_book_sentiment
outcome
Beige Book Sentiment (Three-Class)
Categorical sentiment label assigned to a 256-token chunk of a Federal Reserve Beige Book report, taking values in {negative, mixed, positive}. Derived from human annotations on a -1 to 1 scale, with [-1, -0.2) = negative, [-0.2, 0.2] = mixed, (0.2, 1] = positive.
aliases: BB sentiment, Beige Book tone
C sentiment_classification_accuracy
quantifiable
Classification Accuracy
Proportion of test-set observations whose predicted sentiment label matches the human-labelled ground truth.
aliases: acc
C sentiment_classification_f1
quantifiable
Macro F1 Score
Macro-averaged F1 score across the three sentiment classes, weighting each class equally regardless of base-rate prevalence.
aliases: F1, macro-F1
C sentiment_classification_mcc
quantifiable
Matthews Correlation Coefficient
Correlation-style classification metric incorporating all four cells of the confusion matrix; robust to class imbalance.
aliases: MCC
C model_execution_speed
quantifiable
Model Execution Speed
Wall-clock time required for a model to classify the 200-chunk test set, on a Windows 11 laptop with 32 GB RAM and 3.3 GHz CPU (no GPU), or remote server in the case of GPT-4o.
aliases: inference time
C model_architecture_type
concept
Model Architecture Type
Categorical label distinguishing the architectural family of a sentiment classifier: lexical/dictionary-based, pre-trained transformer LLM, or fine-tuned BERT-family transformer.
aliases: model class
C fine_tuning_treatment
concept
Domain-Specific Fine-Tuning
Binary indicator of whether a model received additional pre-training and supervised fine-tuning on domain-specific (Beige Book) text and labels, beyond general pre-training.
aliases: fine-tuning, FT
C mixed_sentiment_overprediction
quantifiable
Mixed-Class Overprediction Rate
Proportion of test-set chunks a model predicts as mixed minus the human-labelled base rate of mixed (41.5%). Positive values indicate overprediction of the mixed class.
aliases: mixed bias
C gdp_growth_rate
quantifiable
Real GDP Growth Rate (Monthly)
Monthly estimate of the real GDP growth rate from the Brave-Butters-Kelley index, used because official GDP is published only quarterly.
aliases: GDP growth, BBK GDP
C unemployment_rate
quantifiable
U.S. Unemployment Rate
Standard U.S. headline unemployment rate (BLS U-3) over the Beige Book sample period, used as a macroeconomic outcome variable.
aliases: U-3, unemployment
C consumer_price_index
quantifiable
Consumer Price Index
U.S. Consumer Price Index measuring aggregate price-level changes; used as an inflation outcome variable.
aliases: CPI, inflation
// findings.yaml
12 empirical claims
Each finding cites a source and reports effect size, standard error, p-value, and sample size where available.
F001 strong

BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed).

effect 0.71 N 200 unit other
// method: Hold-out test of 200 human-labelled Beige Book chunks; macro F1 and multi-class MCC.
// model: Fine-tuned RoBERTa: further pre-training on 1970-2023 Beige Book corpus (~6.8M words), then supervised fine-tuning on 800 labelled chunks for 3 epochs, batch size 4.
F006 strong

VADER (lexical baseline) achieved accuracy=0.49, F1=0.43, MCC=0.27, substantially below all LLMs. VADER over-predicted positive sentiment (76% predicted positive vs. 39% human) and under-predicted mixed (8% vs. 42% human).

effect 0.49 N 200 unit other
// method: Lexical compound score binned to three classes; hold-out test of 200 chunks.
// model: VADER (Hutto & Gilbert 2014) compound score with -0.2/0.2 bin thresholds.
F007 strong

Non-fine-tuned LLMs systematically over-predict the mixed class on Beige Book chunks: predicted mixed share ranged 65.5% (Llama) to 78.5% (Gemma2), versus a 41.5% human base rate. BeigeSage matched the human base rate at 40.5%.

effect 0.27 N 200 unit other
// method: Class-distribution comparison between predicted labels and human labels on 200-chunk test set.
// model: Frequency comparison of predicted vs. human three-class label distributions.
F008 strong

BeigeSage achieved much higher recall on the positive class (82%) than non-fine-tuned LLMs (29%-47%), while maintaining the highest negative-class recall (64%). Non-fine-tuned LLMs achieved very high mixed-class recall (90%-95%) only at the cost of large recall losses on positive and negative classes.

effect 0.82 N 200 unit other
// method: Per-class precision/recall/F1 on 200-chunk test set.
// model: Per-class recall computed on confusion matrix.
F009 strong

BeigeSage sentiment is positively correlated with monthly real GDP growth: r=0.29, p<0.001 (N=470 Beige-Book-period observations).

correlation_r 0.29 p 0.001 N 470 unit other
// method: Pearson correlation between normalized BeigeSage sentiment (-1 to 1) and Brave-Butters-Kelley monthly GDP growth-rate estimate.
// model: Pearson r on contemporaneous BeigeSage sentiment and BBK GDP growth.
F010 strong

BeigeSage sentiment is negatively correlated with the U.S. unemployment rate: r=-0.24, p<0.001 (N=470).

correlation_r -0.24 p 0.001 N 470 unit other
// method: Pearson correlation, contemporaneous.
// model: Pearson r on contemporaneous BeigeSage sentiment and BLS U-3.
F011 strong

BeigeSage sentiment is positively correlated with the Consumer Price Index: r=0.42, p<0.001 (N=470). Higher Beige Book sentiment co-moves with higher inflation, consistent with strong demand environments generating both optimism and price pressure.

correlation_r 0.42 p 0.001 N 470 unit other
// method: Pearson correlation, contemporaneous.
// model: Pearson r on contemporaneous BeigeSage sentiment and BLS CPI.
F012 moderate

BeigeSage sentiment provides incremental forecasting power for GDP growth at 2-3 Beige-Book-period horizons but adds little at 1 period: ΔR² = 0.003 at h=1, 0.022 at h=2, 0.032 at h=3, on top of a baseline AR(3) model whose own R² collapses from 0.451 (h=1) to 0.043 (h=2) and 0.040 (h=3). For unemployment and CPI, sentiment adds smaller increments and the AR baseline already explains most variance.

effect 0.032 N 470 0.072 unit other
// method: Autoregressive forecasting models with three Beige-Book-period lags of the dependent variable plus contemporaneous BeigeSage sentiment as predictor.
// model: y_t = alpha + sum_{l=1..3} beta_l * y_{t-l} + gamma * sentiment_t + e_t, estimated separately for h=1, 2, 3 forecast horizons.
// propositions.yaml
3 theoretical claims
Propositions are the field's reusable rules of thumb — they span findings without being tied to a single study.
P001
"Domain-specific fine-tuning of a BERT-family transformer on labelled in-domain text outperforms zero-shot prompting of larger general-purpose LLMs on three-class sentiment classification of central-bank economic narratives."
P002
"General-purpose pre-trained LLMs systematically over-predict the mixed/neutral sentiment class when classifying multi-topic economic narratives, because zero-shot prompting biases the model toward hedging in the presence of any conflicting cues, regardless of which signals are dominant."
P003
"Sentiment extracted from Beige Book reports by a domain-fine-tuned classifier correlates with contemporaneous and short-horizon-future U.S. macroeconomic conditions (positively with GDP growth and CPI, negatively with unemployment), and contains incremental forecasting information beyond autoregressive lags of those indicators."
// sources.yaml
1 citations
The evidentiary backing — papers, datasets, reports — every finding can be traced to one of these.
S001
Smith, Charlie; Lambert, Joshua (2026). BeigeSage: sentiment classification in economic texts utilizing LLMs and fine-tuned BERT models. Applied Economics.
quasi_experimental
N = 200
// playbooks/
2 analytical recipes
Step-by-step recipes that wire constructs to engines. An MCP-aware agent runs them end-to-end.
B Quick Start — Evaluate Sentiment Classifiers on Beige Books
10 steps · 5-30 minutes (depending on hardware and remote API latency)
Reproduce the headline comparison from Smith & Lambert (2026): score the 200-chunk human-labelled test set with each model, compute accuracy / macro F1 / MCC, then check correlations against GDP growth, unemployment, and CPI. Assumes you have already trained or downloaded BeigeSage and have API access / local weights for the comparison LLMs. For end-to-end training-from-scratch, see the `reproduce_beigesage` playbook.
engine.correlation_matrixengine.autoregression_with_predictorengine.classification_metricsengine.confusion_distribution
B Reproduce BeigeSage from Scratch (Train + Evaluate)
8 steps · ~6 hours on Colab T4 (39 min MLM pre-training + 5h14m fine-tuning + 1.6 min inference)
End-to-end reproduction of the BeigeSage model: pull the base RoBERTa weights from HuggingFace, MLM-pretrain on the 1970-2023 Beige Book corpus, supervise- fine-tune on the 800-chunk labelled training set, and evaluate on the 200-chunk held-out test set. Recommended environment: Google Colab with a T4 or better GPU (matches authors' setup).
engine.classification_metrics
// playbook step bodies live in the .pax archive; download to inspect.
// relationships.yaml
7 construct edges
The pax's causal graph — which constructs are claimed to drive which others, and how strongly.
fromtokinddirectionstrength
model_architecture_type sentiment_classification_f1 correlational conditional moderate
beige_book_sentiment →+ gdp_growth_rate correlational positive moderate
beige_book_sentiment →− unemployment_rate correlational negative moderate
beige_book_sentiment →+ consumer_price_index correlational positive moderate
fine_tuning_treatment model_execution_speed compositional conditional moderate
// pax.yaml manifest
name: smith-lambert-2026-beigesage
version: 1.0.0
pax_type: paper
author: Smith, Charlie; Lambert, Joshua
license: CC-BY-4.0
published_by: JELambert
domain: beige_book_sentiment_classification
constructs:
  - beige_book_sentiment
  - sentiment_classification_accuracy
  - sentiment_classification_f1
  - sentiment_classification_mcc
  - model_execution_speed
  - model_architecture_type
  - fine_tuning_treatment
  - mixed_sentiment_overprediction
  - gdp_growth_rate
  - unemployment_rate
  - consumer_price_index
engines:
counts:
  constructs: 11
  findings: 12
  propositions: 3
  playbooks: 2
  sources: 1