MultiNews: Multi-Document News Summarisation EDA

MultiNews is a large-scale multi-document news summarisation dataset sourced from newser.com. Each example pairs a human written summary with a cluster of source news articles that were used to produce it. The dataset is widely used for summarisation, retrieval, and multi-document NLP research.

The data is loaded directly from HuggingFace - no local files needed.

HuggingFace dataset: Awesome075/multi_news_parquet
Original paper: Fabbri et al. (2019): Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Parameter	Value
Source	Newser.com article clusters + editor summaries
Splits	train / validation / test
Document field	`document` source articles separated by `\\|\\|\\|\\|\\|`
Summary field	`summary`
Task	Multi-document abstractive summarisation
Language	English

Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
Fabbri, A.R., Li, I., She, T., Li, S., and Radev, D.R.
ACL 2019
https://aclanthology.org/P19-1102/
PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Xiao, W., Beltagy, I., Carenini, G., and Cohan, A.
ACL 2022
https://aclanthology.org/2022.acl-long.360/
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Izacard, G. and Grave, E.
EACL 2021
https://aclanthology.org/2021.eacl-main.74/

In [ ]:

pip install datasets

In [4]:

import re
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import warnings
import random

from collections import Counter
from IPython.display import display, HTML
from datasets import load_dataset # pip install datasets
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# pip install scikit-learn 

warnings.filterwarnings(
    'ignore',
    message='.*unauthenticated requests to the HF Hub.*'
)

FL_BLUE   = '#2563eb'
FL_SLATE  = '#64748b'
FL_AMBER  = '#f59e0b'
FL_GREEN  = '#16a34a'
FL_RED    = '#ef4444'
FL_BG     = '#ffffff'
FL_GRID   = '#e2e8f0'
FL_TEXT   = '#0f172a'
FL_TEXT2  = '#334155'
FL_BORDER = '#e2e8f0'

matplotlib.rcParams.update({
    'figure.facecolor':  FL_BG,
    'axes.facecolor':    FL_BG,
    'axes.edgecolor':    FL_BORDER,
    'axes.labelcolor':   FL_TEXT2,
    'axes.spines.top':   False,
    'axes.spines.right': False,
    'axes.grid':         True,
    'grid.color':        FL_GRID,
    'grid.linewidth':    0.7,
    'xtick.color':       FL_TEXT2,
    'ytick.color':       FL_TEXT2,
    'xtick.labelsize':   10,
    'ytick.labelsize':   10,
    'axes.labelsize':    11,
    'axes.titlesize':    12,
    'axes.titlecolor':   FL_TEXT,
    'axes.titlepad':     12,
    'legend.frameon':    False,
    'legend.fontsize':   10,
    'figure.dpi':        300,
    'savefig.bbox':      'tight',
    'font.family':       'sans-serif',
    'font.sans-serif':   ['Inter', 'Helvetica Neue', 'Arial', 'DejaVu Sans'],
})

DATASET_ID   = 'Awesome075/multi_news_parquet'
SPLITS       = ['train', 'validation', 'test']
DOC_SEP      = '|||||'          # separator between source articles
DOC_MIN_CHARS = 80              # minimum chars to count an article as usable

def norm_ws(s):
    return re.sub(r'\s+', ' ', (s or '').strip())

def split_sources(document_field):
    parts = (document_field or '').split(DOC_SEP)
    return [norm_ws(p) for p in parts if norm_ws(p)]

def get_summary(row):
    if isinstance(row, dict):
        return norm_ws(row.get('summary', '') or '')
    return norm_ws(getattr(row, 'summary', '') or '')

def first_sentence(text):
    text = norm_ws(text)
    m = re.search(r'[.!?]\s', text)
    return text[:m.end()].strip() if m else text[:200]

def word_count(text):
    return len(re.findall(r'[a-zA-Z0-9]+', text or ''))

def tok_content(text):
    stops = ENGLISH_STOP_WORDS
    toks = re.findall(r'[a-z0-9]+', (text or '').lower())
    return [t for t in toks if t not in stops and not t.isdigit()]


print('Setup complete - loading dataset from HuggingFace...')

Setup complete - loading dataset from HuggingFace...

Loading the dataset

The dataset loads directly from HuggingFace. The document field contains source articles joined by ||||| separators. Each article is split and length-filtered to produce a clean source count per cluster.

In [5]:

data = {}
for split in SPLITS:
    data[split] = load_dataset(DATASET_ID, split=split)
    print(f'{split:12} {len(data[split]):>7,} examples')

print(f'\nTotal: {sum(len(v) for v in data.values()):,} examples')
print(f'\nColumn names: {data["train"].column_names}')

# Show a single example structure
ex = data['train'][0]
sources = split_sources(ex['document'])
summary = get_summary(ex)
print(f'\nExample 0 - train split:')
print(f'  Source articles:  {len(sources)}')
print(f'  Summary length:   {len(summary)} chars / {word_count(summary)} words')
print(f'  Summary preview:  {summary[:200]}...')

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

train         44,972 examples
validation     5,622 examples
test           5,622 examples

Total: 56,216 examples

Column names: ['document', 'summary']

Example 0 - train split:
  Source articles:  2
  Summary length:   443 chars / 80 words
  Summary preview:  – The unemployment rate dropped to 8.2% last month, but the economy only added 120,000 jobs, when 203,000 new jobs had been predicted, according to today's jobs report. Reaction on the Wall Street Jou...

Split statistics

Basic size and text length statistics for each split. Source article counts per cluster and summary lengths are the key structural properties of this dataset.

In [6]:

rows = []

train_source_counts = []
train_summary_words = []
train_source_words = []
train_usable_counts = []

for row in data['train']:
    sources = split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
    usable = [s for s in sources if len(s) >= DOC_MIN_CHARS]
    summary = get_summary(row)

    train_source_counts.append(len(sources))
    train_usable_counts.append(len(usable))
    train_summary_words.append(word_count(summary))
    train_source_words.append(sum(word_count(s) for s in usable))

sc = np.array(train_source_counts)
uc = np.array(train_usable_counts)
sw = np.array(train_summary_words)
dw = np.array(train_source_words)

rows.append({
    'Split': 'train',
    'Examples': len(sc),
    'Median sources': float(np.median(sc)),
    'Mean sources': float(np.mean(sc)),
    'Median usable': float(np.median(uc)),
    'Median summ words': float(np.median(sw)),
    'Mean summ words': float(np.mean(sw)),
    'Median src words': float(np.median(dw)),
})

report_df = pd.DataFrame(rows)
report_df[['Median sources', 'Mean sources', 'Median usable', 'Median summ words', 'Mean summ words', 'Median src words']] = (
    report_df[['Median sources', 'Mean sources', 'Median usable', 'Median summ words', 'Mean summ words', 'Median src words']]
    .round(1)
)

display(report_df.T)

	0
Split	train
Examples	44972
Median sources	2.0
Mean sources	2.8
Median usable	2.0
Median summ words	228.0
Mean summ words	225.7
Median src words	1365.0

Source article count distribution

How many source articles are bundled per cluster? Most clusters contain 2-6 articles and up to 10.

In [7]:

train_source_counts = []

for row in data['train']:
    sources = split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
    train_source_counts.append(len(sources))

cnt = Counter(train_source_counts)
max_show = 15
xs = list(range(1, max_show + 1))
ys = [cnt.get(x, 0) for x in xs]

plt.figure(figsize=(8, 4.5))
plt.bar(xs, ys, color=FL_BLUE, alpha=0.85, width=0.7)
plt.xlabel('Number of source articles')
plt.ylabel('Cluster count')
plt.title('Source article count per cluster - train split')
plt.xticks(xs)
plt.tick_params(length=0)
plt.gca().yaxis.set_major_formatter(
    mticker.FuncFormatter(lambda y, _: f'{y/1000:.0f}k' if y >= 1000 else f'{int(y)}')
)

median_src = np.median(train_source_counts)
plt.axvline(
    median_src,
    color=FL_AMBER,
    linewidth=1.5,
    linestyle='--',
    label=f'Median = {median_src:.0f}'
)
plt.legend(fontsize=9)
plt.tight_layout()
plt.show()

No description has been provided for this image

Summary and source text length distributions

Summary length (in words) and total source word count per cluster. Summaries are typically 50-300 words. Source documents are much longer, meaning models must compress aggressively and select relevant content across multiple articles.

In [8]:

train_summary_words = []
train_source_words = []

for row in data['train']:
    sources = split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
    usable = [s for s in sources if len(s) >= DOC_MIN_CHARS]
    summary = get_summary(row)

    train_summary_words.append(word_count(summary))
    train_source_words.append(sum(word_count(s) for s in usable))

sw = np.array(train_summary_words)
dw = np.array(train_source_words)

plt.figure(figsize=(8, 4.5))
plt.hist(
    np.clip(sw, 0, 600),
    bins=40,
    color=FL_BLUE,
    alpha=0.85,
    edgecolor='none'
)
plt.axvline(
    np.median(sw),
    color=FL_AMBER,
    linewidth=1.5,
    linestyle='--',
    label=f'Median = {np.median(sw):.0f}'
)
plt.title('Summary words - train split')
plt.xlabel('Word count (clipped at 600)')
plt.ylabel('Frequency')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.hist(
    np.clip(dw, 0, 5000),
    bins=40,
    color=FL_SLATE,
    alpha=0.85,
    edgecolor='none'
)
plt.axvline(
    np.median(dw),
    color=FL_AMBER,
    linewidth=1.5,
    linestyle='--',
    label=f'Median = {np.median(dw):.0f}'
)
plt.title('Total source words - train split')
plt.xlabel('Word count (clipped at 5,000)')
plt.ylabel('Frequency')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

Compression ratio

Compression ratio = summary word count / total source word count. A value near 0 means extreme compression. This is one of the defining characteristics of MultiNews as a benchmark: models must identify and compress the most relevant content from long, redundant multi-source input.

In [9]:

sw = np.array(train_summary_words, dtype=float)
dw = np.array(train_source_words, dtype=float)

ratio = np.where(dw > 0, sw / dw, 0)
ratio_clipped = np.clip(ratio, 0, 1)

plt.figure(figsize=(8, 4.5))
plt.hist(
    ratio_clipped,
    bins=40,
    color=FL_GREEN,
    alpha=0.85,
    edgecolor='none'
)
plt.axvline(
    float(np.median(ratio_clipped)),
    color=FL_AMBER,
    linewidth=1.5,
    linestyle='--',
    label=f'Median = {np.median(ratio_clipped):.3f}'
)
plt.title('Compression ratio - train split')
plt.xlabel('Summary / source words')
plt.ylabel('Frequency')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

print(
    f'train        median ratio: {np.median(ratio):.4f}  '
    f'mean: {np.mean(ratio):.4f}  '
    f'p10: {np.percentile(ratio, 10):.4f}  '
    f'p90: {np.percentile(ratio, 90):.4f}'
)

/tmp/ipykernel_159288/2177487495.py:4: RuntimeWarning: divide by zero encountered in divide
  ratio = np.where(dw > 0, sw / dw, 0)

train        median ratio: 0.1582  mean: 0.2114  p10: 0.0676  p90: 0.3775

Vocabulary overlap: sources vs summary

Lexical coverage: what fraction of the summary's content words appear somewhere in the source documents? High coverage indicates the summaries are largely extractive in vocabulary even if abstractive in structure. Redundancy measures how much source articles overlap each other (Jaccard similarity of content token sets averaged over all pairs).

In [10]:

rng = random.Random(42)
sample_indices = rng.sample(range(len(data['train'])), k=min(500, len(data['train'])))

coverages = []
redundancies = []

for idx in sample_indices:
    row = data['train'][idx]
    sources = [
        s for s in split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
        if len(s) >= DOC_MIN_CHARS
    ]
    summary = get_summary(row)

    if not sources or not summary:
        continue

    ref_toks = set(tok_content(summary))
    src_union = set()
    src_tok_sets = []

    for s in sources:
        t = set(tok_content(s))
        src_union |= t
        src_tok_sets.append(t)

    cov = len(src_union & ref_toks) / max(1, len(ref_toks)) if ref_toks else 0
    coverages.append(cov)

    pairs = [
        (src_tok_sets[i], src_tok_sets[j])
        for i in range(len(src_tok_sets))
        for j in range(i + 1, len(src_tok_sets))
    ]

    if pairs:
        jac = np.mean([len(a & b) / max(1, len(a | b)) for a, b in pairs])
        redundancies.append(float(jac))

plt.figure(figsize=(8, 4.5))
plt.hist(
    coverages,
    bins=30,
    color=FL_BLUE,
    alpha=0.85,
    edgecolor='none'
)
plt.axvline(
    np.median(coverages),
    color=FL_AMBER,
    linewidth=1.5,
    linestyle='--',
    label=f'Median = {np.median(coverages):.3f}'
)
plt.xlabel('Coverage (summary vocab ∩ source vocab / summary vocab)')
plt.ylabel('Count')
plt.title('Lexical coverage - how much of the summary vocab appears in the source docs')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

plt.figure(figsize=(8, 4.5))
plt.hist(
    redundancies,
    bins=30,
    color=FL_SLATE,
    alpha=0.85,
    edgecolor='none'
)
plt.axvline(
    np.median(redundancies),
    color=FL_AMBER,
    linewidth=1.5,
    linestyle='--',
    label=f'Median = {np.median(redundancies):.3f}'
)
plt.xlabel('Pairwise Jaccard similarity (content tokens)')
plt.ylabel('Count')
plt.title('Source redundancy - average pairwise overlap between source articles')
plt.legend(fontsize=9)
plt.tick_params(length=0)
plt.tight_layout()
plt.show()

print(f'Lexical coverage   - median: {np.median(coverages):.3f}  mean: {np.mean(coverages):.3f}')
print(f'Source redundancy  - median: {np.median(redundancies):.3f}  mean: {np.mean(redundancies):.3f}')

Lexical coverage   - median: 0.698  mean: 0.678
Source redundancy  - median: 0.130  mean: 0.145

Top vocabulary: sources vs summaries

The most frequent content words in source documents compared to summaries.

In [11]:

from collections import Counter

src_vocab = Counter()
summ_vocab = Counter()

sample_n = min(1000, len(data['train']))
for idx in range(sample_n):
    row = data['train'][idx]
    sources = [
        s for s in split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
        if len(s) >= DOC_MIN_CHARS
    ]
    summary = get_summary(row)

    for s in sources:
        src_vocab.update(tok_content(s))

    summ_vocab.update(tok_content(summary))

top_src = src_vocab.most_common(20)
top_summ = summ_vocab.most_common(20)

words_s, counts_s = zip(*top_src)

plt.figure(figsize=(8, 4.5))
plt.barh(
    list(reversed(words_s)),
    list(reversed(counts_s)),
    color=FL_BLUE,
    alpha=0.85,
    height=0.65
)
plt.xlabel('Term frequency')
plt.title(f'Top 20 content words from source articles (n={sample_n:,})')
plt.tick_params(length=0)
plt.gca().xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'{x/1000:.0f}k' if x >= 1000 else f'{int(x)}')
)
plt.tight_layout()
plt.show()

words_u, counts_u = zip(*top_summ)

plt.figure(figsize=(8, 4.5))
plt.barh(
    list(reversed(words_u)),
    list(reversed(counts_u)),
    color=FL_SLATE,
    alpha=0.85,
    height=0.65
)
plt.xlabel('Term frequency')
plt.title(f'Top 20 content words from summaries (n={sample_n:,})')
plt.tick_params(length=0)
plt.gca().xaxis.set_major_formatter(
    mticker.FuncFormatter(lambda x, _: f'{x/1000:.0f}k' if x >= 1000 else f'{int(x)}')
)
plt.tight_layout()
plt.show()

Cluster example walkthrough

A single cluster illustrates the full structure: multiple source articles reporting on the same event from different angles, paired with a concise human written summary.

In [12]:

from IPython.display import display, Markdown

IDX = 10
row = data['train'][IDX]

sources = [
    s for s in split_sources(row['document'] if isinstance(row, dict) else row.get('document', ''))
    if len(s) >= DOC_MIN_CHARS
]
summary = get_summary(row)

display(Markdown('## Summary'))
display(Markdown(summary))

for i, src in enumerate(sources, start=1):
    display(Markdown(f'## Article {i}'))
    display(Markdown(f'Words: {word_count(src)}'))
    display(Markdown(src))

Summary

– Howard Stern has nominated an interesting trio to complete the ice bucket challenge after him. Trouble is, one of them is dead, reports E! Online. In a bizarre video posted to YouTube yesterday, a bare-chested Stern plops an ice cube into a shot glass of water, then pours it over his head, immediately succumbing to mock shivers before hopping off camera. But first he challenges the very much alive Barbara Walters and Mark Consuelos, plus the quite dead Casey Kasem, to douse themselves with ice water next. Stern is aware of Kasem's death at age 82, E! notes, as he's discussed it on his radio show.

Article 1

Words: 86

WTF?! Howard Stern recently completed the ALS Ice Bucket Challenge and shared a video of the do-gooder act on YouTube. While Stern doing the bone-chilling charitable act is nothing out of the ordinary, you may be scratching your head when you hear who he nominates to undertake the challenge next. "Hey everybody, it's Howard Stern ready to take the Ice Bucket Challenge," a shirtless Stern says in the video. "I'm accepting the challenge of...who challenged me? Matt Lauer and Jennifer Aniston."

Article 2

Words: 56

After both Jennifer Aniston and Matt Lauer nominated him, Howard Stern finally accepted the Ice Bucket Challenge - and you won't believe who he nominated! Remember, all this ice bucket nonsense is for a good cause - donate to the ALS Association and help Strike Out ALS! And watch Beth Stern take the ice bucket challenge too!