MultiNews: Multi-Document News Summarisation EDA
MultiNews is a large-scale multi-document news summarisation dataset sourced from newser.com. Each example pairs a human written summary with a cluster of source news articles that were used to produce it. The dataset is widely used for summarisation, retrieval, and multi-document NLP research.
The data is loaded directly from HuggingFace - no local files needed.
HuggingFace dataset: Awesome075/multi_news_parquet
Original paper: Fabbri et al. (2019): Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
| Parameter | Value |
|---|---|
| Source | Newser.com article clusters + editor summaries |
| Splits | train / validation / test |
| Document field | document source articles separated by \|\|\|\|\| |
| Summary field | summary |
| Task | Multi-document abstractive summarisation |
| Language | English |
Related papers
Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
Fabbri, A.R., Li, I., She, T., Li, S., and Radev, D.R.
ACL 2019
https://aclanthology.org/P19-1102/PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Xiao, W., Beltagy, I., Carenini, G., and Cohan, A.
ACL 2022
https://aclanthology.org/2022.acl-long.360/Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Izacard, G. and Grave, E.
EACL 2021
https://aclanthology.org/2021.eacl-main.74/
pip install datasets
import re
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import warnings
import random
from collections import Counter
from IPython.display import display, HTML
from datasets import load_dataset # pip install datasets
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# pip install scikit-learn
warnings.filterwarnings(
'ignore',
message='.*unauthenticated requests to the HF Hub.*'
)
FL_BLUE = '#2563eb'
FL_SLATE = '#64748b'
FL_AMBER = '#f59e0b'
FL_GREEN = '#16a34a'
FL_RED = '#ef4444'
FL_BG = '#ffffff'
FL_GRID = '#e2e8f0'
FL_TEXT = '#0f172a'
FL_TEXT2 = '#334155'
FL_BORDER = '#e2e8f0'
matplotlib.rcParams.update({
'figure.facecolor': FL_BG,
'axes.facecolor': FL_BG,
'axes.edgecolor': FL_BORDER,
'axes.labelcolor': FL_TEXT2,
'axes.spines.top': False,
'axes.spines.right': False,
'axes.grid': True,
'grid.color': FL_GRID,
'grid.linewidth': 0.7,
'xtick.color': FL_TEXT2,
'ytick.color': FL_TEXT2,
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'axes.labelsize': 11,
'axes.titlesize': 12,
'axes.titlecolor': FL_TEXT,
'axes.titlepad': 12,
'legend.frameon': False,
'legend.fontsize': 10,
'figure.dpi': 300,
'savefig.bbox': 'tight',
'font.family': 'sans-serif',
'font.sans-serif': ['Inter', 'Helvetica Neue', 'Arial', 'DejaVu Sans'],
})
DATASET_ID = 'Awesome075/multi_news_parquet'
SPLITS = ['train', 'validation', 'test']
DOC_SEP = '|||||' # separator between source articles
DOC_MIN_CHARS = 80 # minimum chars to count an article as usable
def norm_ws(s):
return re.sub(r'\s+', ' ', (s or '').strip())
def split_sources(document_field):
parts = (document_field or '').split(DOC_SEP)
return [norm_ws(p) for p in parts if norm_ws(p)]
def get_summary(row):
if isinstance(row, dict):
return norm_ws(row.get('summary', '') or '')
return norm_ws(getattr(row, 'summary', '') or '')
def first_sentence(text):
text = norm_ws(text)
m = re.search(r'[.!?]\s', text)
return text[:m.end()].strip() if m else text[:200]
def word_count(text):
return len(re.findall(r'[a-zA-Z0-9]+', text or ''))
def tok_content(text):
stops = ENGLISH_STOP_WORDS
toks = re.findall(r'[a-z0-9]+', (text or '').lower())
return [t for t in toks if t not in stops and not t.isdigit()]
print('Setup complete - loading dataset from HuggingFace...')
Setup complete - loading dataset from HuggingFace...
Loading the dataset
The dataset loads directly from HuggingFace. The document field contains source articles joined by ||||| separators. Each article is split and length-filtered to produce a clean source count per cluster.
data = {}
for split in SPLITS:
data[split] = load_dataset(DATASET_ID, split=split)
print(f'{split:12} {len(data[split]):>7,} examples')
print(f'\nTotal: {sum(len(v) for v in data.values()):,} examples')
print(f'\nColumn names: {data["train"].column_names}')
# Show a single example structure
ex = data['train'][0]
sources = split_sources(ex['document'])
summary = get_summary(ex)
print(f'\nExample 0 - train split:')
print(f' Source articles: {len(sources)}')
print(f' Summary length: {len(summary)} chars / {word_count(summary)} words')
print(f' Summary preview: {summary[:200]}...')
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
train 44,972 examples validation 5,622 examples test 5,622 examples Total: 56,216 examples Column names: ['document', 'summary'] Example 0 - train split: Source articles: 2 Summary length: 443 chars / 80 words Summary preview: – The unemployment rate dropped to 8.2% last month, but the economy only added 120,000 jobs, when 203,000 new jobs had been predicted, according to today's jobs report. Reaction on the Wall Street Jou...
Split statistics
Basic size and text length statistics for each split. Source article counts per cluster and summary lengths are the key structural properties of this dataset.
rows = []
train_source_counts = []
train_summary_words = []
train_source_words = []
train_usable_counts = []
for row in data['train']:
sources = split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
usable = [s for s in sources if len(s) >= DOC_MIN_CHARS]
summary = get_summary(row)
train_source_counts.append(len(sources))
train_usable_counts.append(len(usable))
train_summary_words.append(word_count(summary))
train_source_words.append(sum(word_count(s) for s in usable))
sc = np.array(train_source_counts)
uc = np.array(train_usable_counts)
sw = np.array(train_summary_words)
dw = np.array(train_source_words)
rows.append({
'Split': 'train',
'Examples': len(sc),
'Median sources': float(np.median(sc)),
'Mean sources': float(np.mean(sc)),
'Median usable': float(np.median(uc)),
'Median summ words': float(np.median(sw)),
'Mean summ words': float(np.mean(sw)),
'Median src words': float(np.median(dw)),
})
report_df = pd.DataFrame(rows)
report_df[['Median sources', 'Mean sources', 'Median usable', 'Median summ words', 'Mean summ words', 'Median src words']] = (
report_df[['Median sources', 'Mean sources', 'Median usable', 'Median summ words', 'Mean summ words', 'Median src words']]
.round(1)
)
display(report_df.T)
| 0 | |
|---|---|
| Split | train |
| Examples | 44972 |
| Median sources | 2.0 |
| Mean sources | 2.8 |
| Median usable | 2.0 |
| Median summ words | 228.0 |
| Mean summ words | 225.7 |
| Median src words | 1365.0 |
Source article count distribution
How many source articles are bundled per cluster? Most clusters contain 2-6 articles and up to 10.
train_source_counts = []
for row in data['train']:
sources = split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
train_source_counts.append(len(sources))
cnt = Counter(train_source_counts)
max_show = 15
xs = list(range(1, max_show + 1))
ys = [cnt.get(x, 0) for x in xs]
plt.figure(figsize=(8, 4.5))
plt.bar(xs, ys, color=FL_BLUE, alpha=0.85, width=0.7)
plt.xlabel('Number of source articles')
plt.ylabel('Cluster count')
plt.title('Source article count per cluster - train split')
plt.xticks(xs)
plt.tick_params(length=0)
plt.gca().yaxis.set_major_formatter(
mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{int(y)}')
)
median_src = np.median(train_source_counts)
plt.axvline(
median_src,
color=FL_AMBER,
linewidth=1.5,
linestyle='--',
label=f'Median = {median_src:.0f}'
)
plt.legend(fontsize=9)
plt.tight_layout()
plt.show()
Summary and source text length distributions
Summary length (in words) and total source word count per cluster. Summaries are typically 50-300 words. Source documents are much longer, meaning models must compress aggressively and select relevant content across multiple articles.
train_summary_words = []
train_source_words = []
for row in data['train']:
sources = split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
usable = [s for s in sources if len(s) >= DOC_MIN_CHARS]
summary = get_summary(row)
train_summary_words.append(word_count(summary))
train_source_words.append(sum(word_count(s) for s in usable))
sw = np.array(train_summary_words)
dw = np.array(train_source_words)
plt.figure(figsize=(8, 4.5))
plt.hist(
np.clip(sw, 0, 600),
bins=40,
color=FL_BLUE,
alpha=0.85,
edgecolor='none'
)
plt.axvline(
np.median(sw),
color=FL_AMBER,
linewidth=1.5,
linestyle='--',
label=f'Median = {np.median(sw):.0f}'
)
plt.title('Summary words - train split')
plt.xlabel('Word count (clipped at 600)')
plt.ylabel('Frequency')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.hist(
np.clip(dw, 0, 5000),
bins=40,
color=FL_SLATE,
alpha=0.85,
edgecolor='none'
)
plt.axvline(
np.median(dw),
color=FL_AMBER,
linewidth=1.5,
linestyle='--',
label=f'Median = {np.median(dw):.0f}'
)
plt.title('Total source words - train split')
plt.xlabel('Word count (clipped at 5,000)')
plt.ylabel('Frequency')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
Compression ratio
Compression ratio = summary word count / total source word count. A value near 0 means extreme compression. This is one of the defining characteristics of MultiNews as a benchmark: models must identify and compress the most relevant content from long, redundant multi-source input.
sw = np.array(train_summary_words, dtype=float)
dw = np.array(train_source_words, dtype=float)
ratio = np.where(dw > 0, sw / dw, 0)
ratio_clipped = np.clip(ratio, 0, 1)
plt.figure(figsize=(8, 4.5))
plt.hist(
ratio_clipped,
bins=40,
color=FL_GREEN,
alpha=0.85,
edgecolor='none'
)
plt.axvline(
float(np.median(ratio_clipped)),
color=FL_AMBER,
linewidth=1.5,
linestyle='--',
label=f'Median = {np.median(ratio_clipped):.3f}'
)
plt.title('Compression ratio - train split')
plt.xlabel('Summary / source words')
plt.ylabel('Frequency')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
print(
f'train median ratio: {np.median(ratio):.4f} '
f'mean: {np.mean(ratio):.4f} '
f'p10: {np.percentile(ratio, 10):.4f} '
f'p90: {np.percentile(ratio, 90):.4f}'
)
/tmp/ipykernel_159288/2177487495.py:4: RuntimeWarning: divide by zero encountered in divide ratio = np.where(dw > 0, sw / dw, 0)
train median ratio: 0.1582 mean: 0.2114 p10: 0.0676 p90: 0.3775
Vocabulary overlap: sources vs summary
Lexical coverage: what fraction of the summary's content words appear somewhere in the source documents? High coverage indicates the summaries are largely extractive in vocabulary even if abstractive in structure. Redundancy measures how much source articles overlap each other (Jaccard similarity of content token sets averaged over all pairs).
rng = random.Random(42)
sample_indices = rng.sample(range(len(data['train'])), k=min(500, len(data['train'])))
coverages = []
redundancies = []
for idx in sample_indices:
row = data['train'][idx]
sources = [
s for s in split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
if len(s) >= DOC_MIN_CHARS
]
summary = get_summary(row)
if not sources or not summary:
continue
ref_toks = set(tok_content(summary))
src_union = set()
src_tok_sets = []
for s in sources:
t = set(tok_content(s))
src_union |= t
src_tok_sets.append(t)
cov = len(src_union & ref_toks) / max(1, len(ref_toks)) if ref_toks else 0
coverages.append(cov)
pairs = [
(src_tok_sets[i], src_tok_sets[j])
for i in range(len(src_tok_sets))
for j in range(i + 1, len(src_tok_sets))
]
if pairs:
jac = np.mean([len(a & b) / max(1, len(a | b)) for a, b in pairs])
redundancies.append(float(jac))
plt.figure(figsize=(8, 4.5))
plt.hist(
coverages,
bins=30,
color=FL_BLUE,
alpha=0.85,
edgecolor='none'
)
plt.axvline(
np.median(coverages),
color=FL_AMBER,
linewidth=1.5,
linestyle='--',
label=f'Median = {np.median(coverages):.3f}'
)
plt.xlabel('Coverage (summary vocab ∩ source vocab / summary vocab)')
plt.ylabel('Count')
plt.title('Lexical coverage - how much of the summary vocab appears in the source docs')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
plt.figure(figsize=(8, 4.5))
plt.hist(
redundancies,
bins=30,
color=FL_SLATE,
alpha=0.85,
edgecolor='none'
)
plt.axvline(
np.median(redundancies),
color=FL_AMBER,
linewidth=1.5,
linestyle='--',
label=f'Median = {np.median(redundancies):.3f}'
)
plt.xlabel('Pairwise Jaccard similarity (content tokens)')
plt.ylabel('Count')
plt.title('Source redundancy - average pairwise overlap between source articles')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()
print(f'Lexical coverage - median: {np.median(coverages):.3f} mean: {np.mean(coverages):.3f}')
print(f'Source redundancy - median: {np.median(redundancies):.3f} mean: {np.mean(redundancies):.3f}')
Lexical coverage - median: 0.698 mean: 0.678 Source redundancy - median: 0.130 mean: 0.145
Top vocabulary: sources vs summaries
The most frequent content words in source documents compared to summaries.
from collections import Counter
src_vocab = Counter()
summ_vocab = Counter()
sample_n = min(1000, len(data['train']))
for idx in range(sample_n):
row = data['train'][idx]
sources = [
s for s in split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
if len(s) >= DOC_MIN_CHARS
]
summary = get_summary(row)
for s in sources:
src_vocab.update(tok_content(s))
summ_vocab.update(tok_content(summary))
top_src = src_vocab.most_common(20)
top_summ = summ_vocab.most_common(20)
words_s, counts_s = zip(*top_src)
plt.figure(figsize=(8, 4.5))
plt.barh(
list(reversed(words_s)),
list(reversed(counts_s)),
color=FL_BLUE,
alpha=0.85,
height=0.65
)
plt.xlabel('Term frequency')
plt.title(f'Top 20 content words from source articles (n={sample_n:,})')
plt.tick_params(length=0)
plt.gca().xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'{x/1000:.0f}k' if x >= 1000 else f'{int(x)}')
)
plt.tight_layout()
plt.show()
words_u, counts_u = zip(*top_summ)
plt.figure(figsize=(8, 4.5))
plt.barh(
list(reversed(words_u)),
list(reversed(counts_u)),
color=FL_SLATE,
alpha=0.85,
height=0.65
)
plt.xlabel('Term frequency')
plt.title(f'Top 20 content words from summaries (n={sample_n:,})')
plt.tick_params(length=0)
plt.gca().xaxis.set_major_formatter(
mticker.FuncFormatter(lambda x, _: f'{x/1000:.0f}k' if x >= 1000 else f'{int(x)}')
)
plt.tight_layout()
plt.show()
Cluster example walkthrough
A single cluster illustrates the full structure: multiple source articles reporting on the same event from different angles, paired with a concise human written summary.
from IPython.display import display, Markdown
IDX = 10
row = data['train'][IDX]
sources = [
s for s in split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
if len(s) >= DOC_MIN_CHARS
]
summary = get_summary(row)
display(Markdown('## Summary'))
display(Markdown(summary))
for i, src in enumerate(sources, start=1):
display(Markdown(f'## Article {i}'))
display(Markdown(f'Words: {word_count(src)}'))
display(Markdown(src))
Summary
– Howard Stern has nominated an interesting trio to complete the ice bucket challenge after him. Trouble is, one of them is dead, reports E! Online. In a bizarre video posted to YouTube yesterday, a bare-chested Stern plops an ice cube into a shot glass of water, then pours it over his head, immediately succumbing to mock shivers before hopping off camera. But first he challenges the very much alive Barbara Walters and Mark Consuelos, plus the quite dead Casey Kasem, to douse themselves with ice water next. Stern is aware of Kasem's death at age 82, E! notes, as he's discussed it on his radio show.
Article 1
Words: 86
WTF?! Howard Stern recently completed the ALS Ice Bucket Challenge and shared a video of the do-gooder act on YouTube. While Stern doing the bone-chilling charitable act is nothing out of the ordinary, you may be scratching your head when you hear who he nominates to undertake the challenge next. "Hey everybody, it's Howard Stern ready to take the Ice Bucket Challenge," a shirtless Stern says in the video. "I'm accepting the challenge of...who challenged me? Matt Lauer and Jennifer Aniston."
Article 2
Words: 56
After both Jennifer Aniston and Matt Lauer nominated him, Howard Stern finally accepted the Ice Bucket Challenge - and you won't believe who he nominated! Remember, all this ice bucket nonsense is for a good cause - donate to the ALS Association and help Strike Out ALS! And watch Beth Stern take the ice bucket challenge too!