Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve You Best?

Benchmarking System: Three Moving Parts. Four boxes: 1. Story Corpus; 2. Prompts; 3. Ground Truth Dataset; 4 Models and Settings. Flexibility: parts 1, 2, and 4 can be changed independently to create custom snapshots of accuracy scores. Part 3 (ground truth) is a reflection of Part 1.

We benchmarked 13 leading LLMs—from Anthropic, OpenAI, Google, Meta, and Deepseek—on their ability to identify and categorize 646 source attributions from 43 professionally published news articles. Using a ground truth dataset built by trained student annotators, we measured accuracy across five sourcing elements: sourced statements, type of source, name of source, title, and source justification. Only two models reached an 80% accuracy threshold for identifying sourced statements. All models performed well on structured elements like source type, name, and title. Every model struggled with source justification annotation—the element we assert is critical to ethics-based applications—with the best score reaching only 56%. The complete dataset, prompts, and scoring code are publicly available.

Mar 13, 2026

Model Performance Summary

This table displays accuracy scores for each model across all annotation elements. Green highlighting (80%+) indicates strong performance, yellow (60-80%) shows moderate capability, and red (<60%) signals areas requiring improvement.

Model	Sourced Statement	Type of Source	Name of Source	Title of Source	Source Justification
Gemini 2.5 Pro	83.66%	88.60%	79.94%	79.81%	44.79%
Claude 3.7 Sonnet-Thinking	80.48%	90.01%	84.79%	84.57%	49.00%
Claude 3.7 Sonnet	80.01%	88.25%	82.67%	82.61%	51.95%
GPT 4.1	78.49%	91.07%	86.61%	85.61%	38.91%
Claude 4 Sonnet	76.76%	91.43%	85.47%	86.03%	56.17%
Gemini 1.5 Pro	76.25%	91.89%	89.81%	89.48%	49.79%
GPT 5	73.72%	90.49%	85.25%	86.66%	41.46%
Deepseek Chat-V3-0324	73.37%	88.83%	84.83%	85.24%	43.99%
ChatGPT 4o-latest	72.72%	93.45%	87.69%	87.53%	41.56%
Deepseek R1-0528	67.46%	92.95%	84.81%	83.97%	40.09%
Llama 4 Maverick	66.59%	90.59%	86.16%	86.09%	27.86%
Llama 3.1-405B	64.90%	90.55%	85.32%	85.26%	37.35%
Claude 3.5 Sonnet	61.82%	92.49%	87.93%	87.34%	49.33%

What you need to know

All 13 models performed well (80%+ accuracy) on detecting the structured elements—source type (88–93%), name of the source (80–90%), and title (80–89%). Title-based source analysis of news articles is an easier starting point for applications.
Only 2 of 13 models meet a practical accuracy bar (80%) for sourced attributions enumeration in news stories. Gemini 2.5 Pro (83.66%) and Claude 3.7 Sonnet (80.01%). This is a foundational task all other sourcing analysis depends on.
Source justification remains out of reach for every model. The highest score for identifying why sources are included—the annotation element critical to ethics-based news applications and source diversity audits—was 56%.
Analysis applications on structured sourcing elements are feasible. All models reliably extract source type, name, and title (80–93% accuracy), enabling applications such as expert-vs-non-expert source proportions, organizational representation analysis, and source authority patterns. For e.g. news reporters can use the prompts and a high-scoring model to generate basic sourcing audit files of their own stories.
The dataset and code are fully open. 646 ground truth annotations across 43 news articles, all LLM outputs, prompts, and scoring code are publicly available on Hugging Face and GitHub.

Download a PDF of the LLM Performance Benchmark Report

Access the Dataset

The complete benchmark dataset is publicly available on Hugging Face. The project code (annotation extraction) and dataset is available on GitHub. Everything needed to reproduce, extend, or build on this benchmark is included.

Hugging Face LLM repository

Dataset License: Apache 2.0

Folder Structure

news_articles_corpus/ — 43 news article text files, one per story.
students_annotations_GTs/ — 646 rows of ground truth sourcing annotations (CSV), plus inter-coder reliability scoring code and data.
Llm_generated_annotations/ — LLM-generated annotation files for all 13 models, 3–5 iterations per article per model.

Prompts folder also included: system prompt (with full sourcing vocabulary definitions) and user prompt (annotation instructions).

Direct Folder Links

News article corpus folder

Ground truth annotations folder

LLM-generated annotations folder

File Naming Convention

Articles are numbered 1–43. Every file uses a matching number prefix so article texts, ground truth CSVs, and LLM output CSVs can be cross-referenced directly.

Article text: story-11-Restoring_rights_felons.txt

Ground truth CSV: GT-11Nebraskafelonscrimjusticerev2.csv

LLM annotation CSV: 11Restoring_rights_felonsclaude3_7sonnetv2Sep06.csv

LLM annotation filenames also encode the model name and iteration version, enabling direct comparison across models for the same article.

Example Articles

Four articles used as annotation examples throughout the report:

Article 11: Restoring voting rights for former felons (Nebraska criminal justice reporting)

Article 25: Ballot access for transgender voters (anti-trans laws coverage)

Article 31: Vermont hair discrimination bill (state legislature reporting)

Article 32: OpenAI board restructuring (tech governance reporting)

What Each Annotated CSV Contains

Each row in a ground truth or LLM annotation CSV represents one sourced statement from the article and contains five columns:

Sourced Statement — The verbatim or paraphrased text attributed to a source in the story.

Type of Source — Category: Named Person, Named Organization, Document, Anonymous Source, Unnamed Person, or Unnamed Group of People.

Name of Source — Full name of the person or organization (where applicable).

Title of Source — Formal position, role, or expertise held by the source.

Source Justification — The reporter’s explanation of why this source is included—their connection to the story, lived experience, stakeholder status, or witnessed events.

Download a PDF of the LLM Performance Benchmark Report

LLM Journalism Sourcing GitHub