Article Fetching and Text Extraction Reference
This notebook is a practical reference for collecting readable article text from URLs. It separates the network step from the parsing step so each part can be tested and debugged independently.
Use this notebook when you need to:
- download HTML with
requests - inspect HTTP status codes and response metadata
- extract readable article text with
trafilatura - extract article metadata such as title, author, date, and hostname
- batch process URLs into a clean pandas table
- keep success and failure reasons visible while debugging
In [ ]:
# pip install requests trafilatura pandas
In [ ]:
import time
from pathlib import Path
from urllib.parse import urlparse
import pandas as pd
import requests
import trafilatura
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
USER_AGENT = "AcademicResearchBot/0.1 (+contact@example.com)"
TIMEOUT_SECONDS = 10
REQUEST_DELAY_SECONDS = 1.0
RUN_LIVE_REQUESTS = False
Role separation
A stable article extraction workflow has two separate jobs.
| Step | Tool | Job |
|---|---|---|
| Fetch HTML | requests |
Connect to the URL and download the HTML document. |
| Extract content | trafilatura |
Parse already downloaded HTML and return readable article text and metadata. |
This separation is useful because a failed request and a failed parse are different problems.
In [3]:
sample_url = "https://example.com/articles/dubai-property-market"
sample_html = """
<html>
<head>
<title>Dubai property prices rise again</title>
<meta name="author" content="Market Desk">
<meta property="article:published_time" content="2026-05-01">
</head>
<body>
<nav>Home | Markets | Real Estate</nav>
<article>
<h1>Dubai property prices rise again</h1>
<p>Dubai real estate prices increased during the quarter.</p>
<p>Analysts said demand for apartments and villas remained strong.</p>
<p>Higher interest rates may slow some mortgage-backed purchases.</p>
</article>
<footer>Copyright and related links</footer>
</body>
</html>
"""
In [4]:
article_text = trafilatura.extract(
sample_html,
include_comments=False,
include_tables=False,
url=sample_url,
)
print(article_text)
Dubai property prices rise again Dubai real estate prices increased during the quarter. Analysts said demand for apartments and villas remained strong. Higher interest rates may slow some mortgage-backed purchases. Dubai real estate prices increased during the quarter. Analysts said demand for apartments and villas remained strong. Higher interest rates may slow some mortgage-backed purchases.
In [5]:
metadata = trafilatura.extract_metadata(
sample_html,
default_url=sample_url,
)
metadata_row = {
"title": metadata.title if metadata else None,
"author": metadata.author if metadata else None,
"date": metadata.date if metadata else None,
"hostname": metadata.hostname if metadata else None,
"url": metadata.url if metadata else sample_url,
}
metadata_row
Out[5]:
{'title': 'Dubai property prices rise again',
'author': 'Market Desk',
'date': '2026-05-01',
'hostname': 'example.com',
'url': 'https://example.com/articles/dubai-property-market'}
In [6]:
def fetch_html(url):
headers = {
"User-Agent": USER_AGENT,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
try:
response = requests.get(
url,
headers=headers,
timeout=TIMEOUT_SECONDS,
)
return {
"ok": response.ok,
"url": url,
"final_url": response.url,
"status_code": response.status_code,
"content_type": response.headers.get("content-type"),
"html": response.text if response.ok else None,
"error": None if response.ok else f"HTTP {response.status_code}",
}
except requests.RequestException as exc:
return {
"ok": False,
"url": url,
"final_url": None,
"status_code": None,
"content_type": None,
"html": None,
"error": str(exc),
}
In [7]:
real_url = "https://example.com"
if RUN_LIVE_REQUESTS:
fetch_result = fetch_html(real_url)
else:
fetch_result = {
"ok": True,
"url": sample_url,
"final_url": sample_url,
"status_code": 200,
"content_type": "text/html; charset=utf-8",
"html": sample_html,
"error": None,
}
{k: v for k, v in fetch_result.items() if k != "html"}
Out[7]:
{'ok': True,
'url': 'https://example.com/articles/dubai-property-market',
'final_url': 'https://example.com/articles/dubai-property-market',
'status_code': 200,
'content_type': 'text/html; charset=utf-8',
'error': None}
In [8]:
def extract_article(html, url):
if not html:
return {
"article_text": None,
"title": None,
"author": None,
"date": None,
"hostname": urlparse(url).netloc,
"parse_error": "No HTML available",
}
metadata = trafilatura.extract_metadata(
html,
default_url=url,
)
article_text = trafilatura.extract(
html,
include_comments=False,
include_tables=False,
url=url,
)
return {
"article_text": article_text,
"title": metadata.title if metadata else None,
"author": metadata.author if metadata else None,
"date": metadata.date if metadata else None,
"hostname": metadata.hostname if metadata else urlparse(url).netloc,
"parse_error": None if article_text else "No readable article text extracted",
}
In [9]:
parsed = extract_article(
fetch_result["html"],
fetch_result["final_url"] or fetch_result["url"],
)
parsed
Out[9]:
{'article_text': 'Dubai property prices rise again\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.',
'title': 'Dubai property prices rise again',
'author': 'Market Desk',
'date': '2026-05-01',
'hostname': 'example.com',
'parse_error': None}
In [10]:
def fetch_and_extract(url):
fetch_result = fetch_html(url)
parsed = extract_article(
fetch_result["html"],
fetch_result["final_url"] or url,
)
return {
"url": url,
"final_url": fetch_result["final_url"],
"status_code": fetch_result["status_code"],
"content_type": fetch_result["content_type"],
"fetch_ok": fetch_result["ok"],
"fetch_error": fetch_result["error"],
"title": parsed["title"],
"author": parsed["author"],
"date": parsed["date"],
"hostname": parsed["hostname"],
"article_text": parsed["article_text"],
"parse_error": parsed["parse_error"],
}
In [11]:
urls = [
sample_url,
"https://example.com/articles/oil-demand-update",
"https://example.com/articles/missing-page",
]
sample_html_by_url = {
sample_url: sample_html,
"https://example.com/articles/oil-demand-update": """
<html>
<head><title>Oil demand concerns pressure prices</title></head>
<body>
<article>
<h1>Oil demand concerns pressure prices</h1>
<p>Oil prices declined after demand concerns increased.</p>
<p>Energy traders watched central bank policy and inventory data.</p>
</article>
</body>
</html>
""",
}
In [12]:
records = []
for url in urls:
if RUN_LIVE_REQUESTS:
record = fetch_and_extract(url)
time.sleep(REQUEST_DELAY_SECONDS)
else:
html = sample_html_by_url.get(url)
fetch_result = {
"ok": html is not None,
"url": url,
"final_url": url,
"status_code": 200 if html is not None else 404,
"content_type": "text/html; charset=utf-8" if html is not None else None,
"html": html,
"error": None if html is not None else "HTTP 404",
}
parsed = extract_article(html, url)
record = {
"url": url,
"final_url": fetch_result["final_url"],
"status_code": fetch_result["status_code"],
"content_type": fetch_result["content_type"],
"fetch_ok": fetch_result["ok"],
"fetch_error": fetch_result["error"],
"title": parsed["title"],
"author": parsed["author"],
"date": parsed["date"],
"hostname": parsed["hostname"],
"article_text": parsed["article_text"],
"parse_error": parsed["parse_error"],
}
records.append(record)
df_articles = pd.DataFrame(records)
df_articles
Out[12]:
| url | final_url | status_code | content_type | fetch_ok | fetch_error | title | author | date | hostname | article_text | parse_error | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://example.com/articles/dubai-property-ma... | https://example.com/articles/dubai-property-ma... | 200 | text/html; charset=utf-8 | True | NaN | Dubai property prices rise again | Market Desk | 2026-05-01 | example.com | Dubai property prices rise again\nDubai real e... | NaN |
| 1 | https://example.com/articles/oil-demand-update | https://example.com/articles/oil-demand-update | 200 | text/html; charset=utf-8 | True | NaN | Oil demand concerns pressure prices | NaN | NaN | example.com | Oil demand concerns pressure prices\nOil price... | NaN |
| 2 | https://example.com/articles/missing-page | https://example.com/articles/missing-page | 404 | NaN | False | HTTP 404 | NaN | NaN | NaN | example.com | NaN | No HTML available |
In [13]:
df_success = df_articles.dropna(subset=["article_text"]).copy()
df_success = df_success[df_success["article_text"].astype(str).str.strip() != ""]
df_success["text_length"] = df_success["article_text"].str.len()
print(f"Total URLs: {len(df_articles)}")
print(f"Successful extractions: {len(df_success)}")
print(f"Failed extractions: {len(df_articles) - len(df_success)}")
df_success[["title", "hostname", "text_length", "article_text"]]
Total URLs: 3 Successful extractions: 2 Failed extractions: 1
Out[13]:
| title | hostname | text_length | article_text | |
|---|---|---|---|---|
| 0 | Dubai property prices rise again | example.com | 396 | Dubai property prices rise again\nDubai real e... |
| 1 | Oil demand concerns pressure prices | example.com | 267 | Oil demand concerns pressure prices\nOil price... |
In [14]:
failure_columns = [
"url",
"status_code",
"fetch_error",
"parse_error",
]
df_failures = df_articles[
(df_articles["fetch_error"].notna()) |
(df_articles["parse_error"].notna())
][failure_columns]
df_failures
Out[14]:
| url | status_code | fetch_error | parse_error | |
|---|---|---|---|---|
| 2 | https://example.com/articles/missing-page | 404 | HTTP 404 | No HTML available |