$ ⌘K

Smith & Lambert (2026) — Comparative evaluation of LLMs (GPT-4o, Llama 3

v1.0.0 ·Beige Book Sentiment Classification

Comparative evaluation of LLMs (GPT-4o, Llama 3.1, Phi3, Gemma2), a fine-tuned RoBERTa model (BeigeSage), and a lexical baseline (VADER) for sentiment classification of Federal Reserve Beige Book reports across positive, negative, and mixed classes. Demonstrates that domain-specific fine-tuning yields the best classification performance and that BeigeSage sentiment correlates with macroeconomic indicators (GDP growth, unemployment, CPI).

↓ download 18.3 KB

constructs

findings

propositions

sources

playbooks

// domain

Beige Book Sentiment Classification

Federal Reserve Beige Book report chunks (256-token segments)

macro 1970-2023 (Beige Book corpus); evaluation 2007-2023 sample

How do open-source vs. closed-source LLMs compare on economic-text sentiment classification?

Does domain-specific fine-tuning of a BERT-family model outperform general-purpose pre-trained LLMs on Beige Book sentiment?

How do LLMs handle mixed-sentiment passages typical of central-bank communication?

Does Beige Book sentiment capture meaningful signal about macroeconomic outcomes (GDP, unemployment, CPI)?

// top findings

12 empirical claims

view all →

F001 ↑ strong

BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed).

effect=0.71· N=200

F002 ↑ strong

Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs.

effect=0.68· N=200

F003 ↑ strong

GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model.

effect=0.63· N=200

// top constructs

11 vocabulary terms

view all →

C outcome

Beige Book Sentiment (Three-Class)

Categorical sentiment label assigned to a 256-token chunk of a Federal Reserve Beige Book report,…

C quantifiable

Classification Accuracy

Proportion of test-set observations whose predicted sentiment label matches the human-labelled…

C quantifiable

Macro F1 Score

Macro-averaged F1 score across the three sentiment classes, weighting each class equally regardless…

C quantifiable

Matthews Correlation Coefficient

Correlation-style classification metric incorporating all four cells of the confusion matrix; robust…

// abstract

Abstract

Domain: Beige Book Sentiment Classification

Application of large language models, fine-tuned BERT-family models, and lexical methods to classify sentiment (positive, negative, mixed) in the Federal Reserve’s Beige Book reports, and evaluation of the resulting sentiment series against macroeconomic indicators.

Temporal scope: 1970-2023 (Beige Book corpus); evaluation 2007-2023 sample | Population: Federal Reserve Beige Book report chunks (256-token segments)

Key Findings

BeigeSage (fine-tuned RoBERTa) achieved the highest overall classification performance on Beige Book sentiment: accuracy=0.71, macro F1=0.71, MCC=0.55 on a 200-chunk held-out test set across three classes (positive, negative, mixed). (positive, strong)
Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs. (positive, strong)
GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model. (positive, strong)
Phi3 (3.8B parameters, Microsoft, zero-shot) achieved accuracy=0.62, F1=0.60, MCC=0.45, demonstrating that small open-source models can approach larger closed-source LLMs on Beige Book sentiment. (positive, moderate)
Gemma2 (9B parameters, Google, zero-shot) achieved accuracy=0.59, F1=0.55, MCC=0.41, the weakest of the LLMs tested. (positive, moderate)
VADER (lexical baseline) achieved accuracy=0.49, F1=0.43, MCC=0.27, substantially below all LLMs. VADER over-predicted positive sentiment (76% predicted positive vs. 39% human) and under-predicted mixed (8% vs. 42% human). (negative, strong)
Non-fine-tuned LLMs systematically over-predict the mixed class on Beige Book chunks: predicted mixed share ranged 65.5% (Llama) to 78.5% (Gemma2), versus a 41.5% human base rate. BeigeSage matched the human base rate at 40.5%. (positive, strong)
BeigeSage achieved much higher recall on the positive class (82%) than non-fine-tuned LLMs (29%-47%), while maintaining the highest negative-class recall (64%). Non-fine-tuned LLMs achieved very high mixed-class recall (90%-95%) only at the cost of large recall losses on positive and negative classes. (positive, strong)

…and 4 more findings

Theoretical Propositions

[+] Domain-specific fine-tuning of a BERT-family transformer on labelled in-domain text outperforms zero-shot prompting of larger general-purpose LLMs on three-class sentiment classification of central-bank economic narratives.
[+] General-purpose pre-trained LLMs systematically over-predict the mixed/neutral sentiment class when classifying multi-topic economic narratives, because zero-shot prompting biases the model toward hedging in the presence of any conflicting cues, regardless of which signals are dominant.
[+] Sentiment extracted from Beige Book reports by a domain-fine-tuned classifier correlates with contemporaneous and short-horizon-future U.S. macroeconomic conditions (positively with GDP growth and CPI, negatively with unemployment), and contains incremental forecasting information beyond autoregressive lags of those indicators.

// tags

sentiment-analysis llm bert roberta beige-book federal-reserve text-classification economic-forecasting fine-tuning

// analytical

Playbooks

view all →

B quick_start 10 steps

Reproduce the headline comparison from Smith & Lambert (2026): score the 200-chunk human-labelled test set with each model, compute accuracy / macro F1 / MCC, then check correlations against GDP growth, unemployment, and CPI. Assumes you have already trained or downloaded BeigeSage and have API access / local weights for the comparison LLMs. For end-to-end training-from-scratch, see the `reproduce_beigesage` playbook.

B reproduce_beigesage 8 steps

End-to-end reproduction of the BeigeSage model: pull the base RoBERTa weights from HuggingFace, MLM-pretrain on the 1970-2023 Beige Book corpus, supervise- fine-tune on the 800-chunk labelled training set, and evaluate on the 200-chunk held-out test set. Recommended environment: Google Colab with a T4 or better GPU (matches authors' setup).

// registry meta

domainBeige Book Sentiment Classification

levelmacro

populationFederal Reserve Beige Book report chunks (256-token segments)

pax typepaper

version1.0.0

published byJELambert

archive18.3 KB

// research questions

How do open-source vs. closed-source LLMs compare on economic-text sentiment classification?
Does domain-specific fine-tuning of a BERT-family model outperform general-purpose pre-trained LLMs on Beige Book sentiment?
How do LLMs handle mixed-sentiment passages typical of central-bank communication?
Does Beige Book sentiment capture meaningful signal about macroeconomic outcomes (GDP, unemployment, CPI)?

// key constructs

Vocabulary

view all →

Beige Book Sentiment (Three-Class) outcome

Categorical sentiment label assigned to a 256-token chunk of a Federal Reserve Beige Book…

Classification Accuracy quantifiable

Proportion of test-set observations whose predicted sentiment label matches the…

Macro F1 Score quantifiable

Macro-averaged F1 score across the three sentiment classes, weighting each class equally…

Matthews Correlation Coefficient quantifiable

Correlation-style classification metric incorporating all four cells of the confusion…

Model Execution Speed quantifiable

Wall-clock time required for a model to classify the 200-chunk test set, on a Windows 11…

Model Architecture Type concept

Categorical label distinguishing the architectural family of a sentiment classifier:…

// constructs.yaml

11 variables in the pax vocabulary

Each construct names a thing the field measures, with a kind and an authoritative definition.

C beige_book_sentiment

outcome

Beige Book Sentiment (Three-Class)

Categorical sentiment label assigned to a 256-token chunk of a Federal Reserve Beige Book report, taking values in {negative, mixed, positive}. Derived from human annotations on a -1 to 1 scale, with [-1, -0.2) = negative, [-0.2, 0.2] = mixed, (0.2, 1] = positive.

aliases: BB sentiment, Beige Book tone

C sentiment_classification_accuracy

quantifiable

Classification Accuracy

Proportion of test-set observations whose predicted sentiment label matches the human-labelled ground truth.

aliases: acc

C sentiment_classification_f1

quantifiable

Macro F1 Score

Macro-averaged F1 score across the three sentiment classes, weighting each class equally regardless of base-rate prevalence.

aliases: F1, macro-F1

C sentiment_classification_mcc

quantifiable

Matthews Correlation Coefficient

Correlation-style classification metric incorporating all four cells of the confusion matrix; robust to class imbalance.

aliases: MCC

C model_execution_speed

quantifiable

Model Execution Speed

Wall-clock time required for a model to classify the 200-chunk test set, on a Windows 11 laptop with 32 GB RAM and 3.3 GHz CPU (no GPU), or remote server in the case of GPT-4o.

aliases: inference time

C model_architecture_type

concept

Model Architecture Type

Categorical label distinguishing the architectural family of a sentiment classifier: lexical/dictionary-based, pre-trained transformer LLM, or fine-tuned BERT-family transformer.

aliases: model class

C fine_tuning_treatment

concept

Domain-Specific Fine-Tuning

Binary indicator of whether a model received additional pre-training and supervised fine-tuning on domain-specific (Beige Book) text and labels, beyond general pre-training.

aliases: fine-tuning, FT

C mixed_sentiment_overprediction

quantifiable

Mixed-Class Overprediction Rate

Proportion of test-set chunks a model predicts as mixed minus the human-labelled base rate of mixed (41.5%). Positive values indicate overprediction of the mixed class.

aliases: mixed bias

C gdp_growth_rate

quantifiable

Real GDP Growth Rate (Monthly)

Monthly estimate of the real GDP growth rate from the Brave-Butters-Kelley index, used because official GDP is published only quarterly.

aliases: GDP growth, BBK GDP

C unemployment_rate

quantifiable

U.S. Unemployment Rate

Standard U.S. headline unemployment rate (BLS U-3) over the Beige Book sample period, used as a macroeconomic outcome variable.

aliases: U-3, unemployment

C consumer_price_index

quantifiable

Consumer Price Index

U.S. Consumer Price Index measuring aggregate price-level changes; used as an inflation outcome variable.

aliases: CPI, inflation

// findings.yaml

12 empirical claims

Each finding cites a source and reports effect size, standard error, p-value, and sample size where available.

F001 ↑ strong

sentiment_classification_accuracy,sentiment_classification_f1,sentiment_classification_mcc,fine_tuning_treatment

effect 0.71 N 200 unit other

// method: Hold-out test of 200 human-labelled Beige Book chunks; macro F1 and multi-class MCC.

// model: Fine-tuned RoBERTa: further pre-training on 1970-2023 Beige Book corpus (~6.8M words), then supervised fine-tuning on 800 labelled chunks for 3 epochs, batch size 4.

F002 ↑ strong

sentiment_classification_accuracy,sentiment_classification_f1,sentiment_classification_mcc,model_architecture_type

Llama 3.1 (8B parameters, zero-shot) achieved accuracy=0.68, F1=0.67, MCC=0.52 on the same Beige Book test set, the best among non-fine-tuned LLMs.

effect 0.68 N 200 unit other

// method: Zero-shot prompting, hold-out test of 200 chunks.

// model: Llama 3.1 8B run locally, zero-shot prompt asking for label as JSON value.

F003 ↑ strong

sentiment_classification_accuracy,sentiment_classification_f1,sentiment_classification_mcc,model_architecture_type

GPT-4o (zero-shot, OpenAI API) achieved accuracy=0.63, F1=0.62, MCC=0.48 on the Beige Book test set, underperforming both BeigeSage and Llama 3.1 8B despite being a much larger closed-source model.

effect 0.63 N 200 unit other

// method: Zero-shot via OpenAI API, hold-out test of 200 chunks.

// model: GPT-4o (May 2024 release), zero-shot prompt.

F004 ↑ moderate

sentiment_classification_accuracy,sentiment_classification_f1,sentiment_classification_mcc,model_architecture_type

Phi3 (3.8B parameters, Microsoft, zero-shot) achieved accuracy=0.62, F1=0.60, MCC=0.45, demonstrating that small open-source models can approach larger closed-source LLMs on Beige Book sentiment.

effect 0.62 N 200 unit other

// method: Zero-shot prompting, hold-out test of 200 chunks.

// model: Phi3 3.8B run locally, zero-shot prompt.

F005 ↑ moderate

sentiment_classification_accuracy,sentiment_classification_f1,sentiment_classification_mcc,model_architecture_type

Gemma2 (9B parameters, Google, zero-shot) achieved accuracy=0.59, F1=0.55, MCC=0.41, the weakest of the LLMs tested.

effect 0.59 N 200 unit other

// method: Zero-shot prompting, hold-out test of 200 chunks.

// model: Gemma2 9B run locally, zero-shot prompt.

F006 ↓ strong

sentiment_classification_accuracy,sentiment_classification_f1,sentiment_classification_mcc,model_architecture_type

VADER (lexical baseline) achieved accuracy=0.49, F1=0.43, MCC=0.27, substantially below all LLMs. VADER over-predicted positive sentiment (76% predicted positive vs. 39% human) and under-predicted mixed (8% vs. 42% human).

effect 0.49 N 200 unit other

// method: Lexical compound score binned to three classes; hold-out test of 200 chunks.

// model: VADER (Hutto & Gilbert 2014) compound score with -0.2/0.2 bin thresholds.

F007 ↑ strong

mixed_sentiment_overprediction,fine_tuning_treatment,model_architecture_type

Non-fine-tuned LLMs systematically over-predict the mixed class on Beige Book chunks: predicted mixed share ranged 65.5% (Llama) to 78.5% (Gemma2), versus a 41.5% human base rate. BeigeSage matched the human base rate at 40.5%.

effect 0.27 N 200 unit other

// method: Class-distribution comparison between predicted labels and human labels on 200-chunk test set.

// model: Frequency comparison of predicted vs. human three-class label distributions.

F008 ↑ strong

sentiment_classification_f1,fine_tuning_treatment,beige_book_sentiment

BeigeSage achieved much higher recall on the positive class (82%) than non-fine-tuned LLMs (29%-47%), while maintaining the highest negative-class recall (64%). Non-fine-tuned LLMs achieved very high mixed-class recall (90%-95%) only at the cost of large recall losses on positive and negative classes.

effect 0.82 N 200 unit other

// method: Per-class precision/recall/F1 on 200-chunk test set.

// model: Per-class recall computed on confusion matrix.

F009 ↑ strong

beige_book_sentiment,gdp_growth_rate

BeigeSage sentiment is positively correlated with monthly real GDP growth: r=0.29, p<0.001 (N=470 Beige-Book-period observations).

correlation_r 0.29 p 0.001 N 470 unit other

// method: Pearson correlation between normalized BeigeSage sentiment (-1 to 1) and Brave-Butters-Kelley monthly GDP growth-rate estimate.

// model: Pearson r on contemporaneous BeigeSage sentiment and BBK GDP growth.

F010 ↓ strong

beige_book_sentiment,unemployment_rate

BeigeSage sentiment is negatively correlated with the U.S. unemployment rate: r=-0.24, p<0.001 (N=470).

correlation_r -0.24 p 0.001 N 470 unit other

// method: Pearson correlation, contemporaneous.

// model: Pearson r on contemporaneous BeigeSage sentiment and BLS U-3.

F011 ↑ strong

beige_book_sentiment,consumer_price_index

BeigeSage sentiment is positively correlated with the Consumer Price Index: r=0.42, p<0.001 (N=470). Higher Beige Book sentiment co-moves with higher inflation, consistent with strong demand environments generating both optimism and price pressure.

correlation_r 0.42 p 0.001 N 470 unit other

// method: Pearson correlation, contemporaneous.

// model: Pearson r on contemporaneous BeigeSage sentiment and BLS CPI.

F012 → moderate

beige_book_sentiment,gdp_growth_rate,unemployment_rate,consumer_price_index

BeigeSage sentiment provides incremental forecasting power for GDP growth at 2-3 Beige-Book-period horizons but adds little at 1 period: ΔR² = 0.003 at h=1, 0.022 at h=2, 0.032 at h=3, on top of a baseline AR(3) model whose own R² collapses from 0.451 (h=1) to 0.043 (h=2) and 0.040 (h=3). For unemployment and CPI, sentiment adds smaller increments and the AR baseline already explains most variance.

effect 0.032 N 470 R² 0.072 unit other

// method: Autoregressive forecasting models with three Beige-Book-period lags of the dependent variable plus contemporaneous BeigeSage sentiment as predictor.

// model: y_t = alpha + sum_{l=1..3} beta_l * y_{t-l} + gamma * sentiment_t + e_t, estimated separately for h=1, 2, 3 forecast horizons.

// propositions.yaml

3 theoretical claims

Propositions are the field's reusable rules of thumb — they span findings without being tied to a single study.

P001

↑

"Domain-specific fine-tuning of a BERT-family transformer on labelled in-domain text outperforms zero-shot prompting of larger general-purpose LLMs on three-class sentiment classification of central-bank economic narratives."

P002

↑

"General-purpose pre-trained LLMs systematically over-predict the mixed/neutral sentiment class when classifying multi-topic economic narratives, because zero-shot prompting biases the model toward hedging in the presence of any conflicting cues, regardless of which signals are dominant."

P003

↑

"Sentiment extracted from Beige Book reports by a domain-fine-tuned classifier correlates with contemporaneous and short-horizon-future U.S. macroeconomic conditions (positively with GDP growth and CPI, negatively with unemployment), and contains incremental forecasting information beyond autoregressive lags of those indicators."

// sources.yaml

1 citations

The evidentiary backing — papers, datasets, reports — every finding can be traced to one of these.

S001

Smith, Charlie; Lambert, Joshua (2026). BeigeSage: sentiment classification in economic texts utilizing LLMs and fine-tuned BERT models. Applied Economics.

doi:10.1080/00036846.2026.2647124 ↗

quasi_experimental

N = 200

// playbooks/

2 analytical recipes

Step-by-step recipes that wire constructs to engines. An MCP-aware agent runs them end-to-end.

B Quick Start — Evaluate Sentiment Classifiers on Beige Books

10 steps · 5-30 minutes (depending on hardware and remote API latency)

engine.correlation_matrixengine.autoregression_with_predictorengine.classification_metricsengine.confusion_distribution

B Reproduce BeigeSage from Scratch (Train + Evaluate)

8 steps · ~6 hours on Colab T4 (39 min MLM pre-training + 5h14m fine-tuning + 1.6 min inference)

engine.classification_metrics

// playbook step bodies live in the .pax archive; download to inspect.

// relationships.yaml

7 construct edges

The pax's causal graph — which constructs are claimed to drive which others, and how strongly.

fromtokinddirectionstrength

fine_tuning_treatment →+ sentiment_classification_accuracy causal positive strong

fine_tuning_treatment →− mixed_sentiment_overprediction causal negative strong

model_architecture_type → sentiment_classification_f1 correlational conditional moderate

beige_book_sentiment →+ gdp_growth_rate correlational positive moderate

beige_book_sentiment →− unemployment_rate correlational negative moderate

beige_book_sentiment →+ consumer_price_index correlational positive moderate

fine_tuning_treatment → model_execution_speed compositional conditional moderate

// pax.yaml manifest

name: smith-lambert-2026-beigesage
version: 1.0.0
pax_type: paper
author: Smith, Charlie; Lambert, Joshua
license: CC-BY-4.0
published_by: JELambert
domain: beige_book_sentiment_classification
constructs:
  - beige_book_sentiment
  - sentiment_classification_accuracy
  - sentiment_classification_f1
  - sentiment_classification_mcc
  - model_execution_speed
  - model_architecture_type
  - fine_tuning_treatment
  - mixed_sentiment_overprediction
  - gdp_growth_rate
  - unemployment_rate
  - consumer_price_index
engines:
counts:
  constructs: 11
  findings: 12
  propositions: 3
  playbooks: 2
  sources: 1