Ctrl K

Article Fetching and Text Extraction Reference

This notebook is a practical reference for collecting readable article text from URLs. It separates the network step from the parsing step so each part can be tested and debugged independently.

Use this notebook when you need to:

  • download HTML with requests
  • inspect HTTP status codes and response metadata
  • extract readable article text with trafilatura
  • extract article metadata such as title, author, date, and hostname
  • batch process URLs into a clean pandas table
  • keep success and failure reasons visible while debugging

Install packages

Install the packages used in this notebook. Run this once in your notebook environment.

In [ ]:
# pip install requests trafilatura pandas

Imports and configuration

Keep crawler settings near the top. This makes it easy to change timeout, user agent, output path, and whether live requests should run.

In [ ]:
import time
from pathlib import Path
from urllib.parse import urlparse

import pandas as pd
import requests
import trafilatura

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

USER_AGENT = "AcademicResearchBot/0.1 (+contact@example.com)"
TIMEOUT_SECONDS = 10
REQUEST_DELAY_SECONDS = 1.0

RUN_LIVE_REQUESTS = False

Role separation

A stable article extraction workflow has two separate jobs.

Step Tool Job
Fetch HTML requests Connect to the URL and download the HTML document.
Extract content trafilatura Parse already downloaded HTML and return readable article text and metadata.

This separation is useful because a failed request and a failed parse are different problems.

Start with sample HTML

Use a small local HTML example before calling real websites. This makes the parsing step reproducible and keeps the notebook runnable without internet access.

In [3]:
sample_url = "https://example.com/articles/dubai-property-market"

sample_html = """
<html>
  <head>
    <title>Dubai property prices rise again</title>
    <meta name="author" content="Market Desk">
    <meta property="article:published_time" content="2026-05-01">
  </head>
  <body>
    <nav>Home | Markets | Real Estate</nav>
    <article>
      <h1>Dubai property prices rise again</h1>
      <p>Dubai real estate prices increased during the quarter.</p>
      <p>Analysts said demand for apartments and villas remained strong.</p>
      <p>Higher interest rates may slow some mortgage-backed purchases.</p>
    </article>
    <footer>Copyright and related links</footer>
  </body>
</html>
"""

Extract text from HTML

trafilatura.extract takes HTML text and returns the readable main content. In this pattern, Trafilatura does not fetch the URL. It only parses the HTML string you already have.

In [4]:
article_text = trafilatura.extract(
    sample_html,
    include_comments=False,
    include_tables=False,
    url=sample_url,
)

print(article_text)
Dubai property prices rise again
Dubai real estate prices increased during the quarter.
Analysts said demand for apartments and villas remained strong.
Higher interest rates may slow some mortgage-backed purchases.
Dubai real estate prices increased during the quarter.
Analysts said demand for apartments and villas remained strong.
Higher interest rates may slow some mortgage-backed purchases.

Extract metadata from HTML

Metadata extraction is separate from text extraction. Some pages have strong metadata. Others only have partial metadata or none at all.

In [5]:
metadata = trafilatura.extract_metadata(
    sample_html,
    default_url=sample_url,
)

metadata_row = {
    "title": metadata.title if metadata else None,
    "author": metadata.author if metadata else None,
    "date": metadata.date if metadata else None,
    "hostname": metadata.hostname if metadata else None,
    "url": metadata.url if metadata else sample_url,
}

metadata_row
Out[5]:
{'title': 'Dubai property prices rise again',
 'author': 'Market Desk',
 'date': '2026-05-01',
 'hostname': 'example.com',
 'url': 'https://example.com/articles/dubai-property-market'}

Fetch HTML with requests

requests.get is the network layer. It sends an HTTP request and returns a response object.

The response may fail before Trafilatura sees anything. Common failure reasons include timeout, blocked user agents, redirects, 403, 404, and server errors.

In [6]:
def fetch_html(url):
    headers = {
        "User-Agent": USER_AGENT,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    try:
        response = requests.get(
            url,
            headers=headers,
            timeout=TIMEOUT_SECONDS,
        )

        return {
            "ok": response.ok,
            "url": url,
            "final_url": response.url,
            "status_code": response.status_code,
            "content_type": response.headers.get("content-type"),
            "html": response.text if response.ok else None,
            "error": None if response.ok else f"HTTP {response.status_code}",
        }
    except requests.RequestException as exc:
        return {
            "ok": False,
            "url": url,
            "final_url": None,
            "status_code": None,
            "content_type": None,
            "html": None,
            "error": str(exc),
        }

Optional live request

The notebook keeps live requests disabled by default. Set RUN_LIVE_REQUESTS = True when you want to test against a real URL.

In [7]:
real_url = "https://example.com"

if RUN_LIVE_REQUESTS:
    fetch_result = fetch_html(real_url)
else:
    fetch_result = {
        "ok": True,
        "url": sample_url,
        "final_url": sample_url,
        "status_code": 200,
        "content_type": "text/html; charset=utf-8",
        "html": sample_html,
        "error": None,
    }

{k: v for k, v in fetch_result.items() if k != "html"}
Out[7]:
{'ok': True,
 'url': 'https://example.com/articles/dubai-property-market',
 'final_url': 'https://example.com/articles/dubai-property-market',
 'status_code': 200,
 'content_type': 'text/html; charset=utf-8',
 'error': None}

Parse fetched HTML

Once HTML exists, the parsing step should not care whether the HTML came from a real request or a local test fixture.

In [8]:
def extract_article(html, url):
    if not html:
        return {
            "article_text": None,
            "title": None,
            "author": None,
            "date": None,
            "hostname": urlparse(url).netloc,
            "parse_error": "No HTML available",
        }

    metadata = trafilatura.extract_metadata(
        html,
        default_url=url,
    )

    article_text = trafilatura.extract(
        html,
        include_comments=False,
        include_tables=False,
        url=url,
    )

    return {
        "article_text": article_text,
        "title": metadata.title if metadata else None,
        "author": metadata.author if metadata else None,
        "date": metadata.date if metadata else None,
        "hostname": metadata.hostname if metadata else urlparse(url).netloc,
        "parse_error": None if article_text else "No readable article text extracted",
    }
In [9]:
parsed = extract_article(
    fetch_result["html"],
    fetch_result["final_url"] or fetch_result["url"],
)

parsed
Out[9]:
{'article_text': 'Dubai property prices rise again\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.',
 'title': 'Dubai property prices rise again',
 'author': 'Market Desk',
 'date': '2026-05-01',
 'hostname': 'example.com',
 'parse_error': None}

Combine fetch and parse for one URL

This wrapper keeps the full record in one dictionary. It includes both network fields and parser fields, which makes debugging easier later.

In [10]:
def fetch_and_extract(url):
    fetch_result = fetch_html(url)

    parsed = extract_article(
        fetch_result["html"],
        fetch_result["final_url"] or url,
    )

    return {
        "url": url,
        "final_url": fetch_result["final_url"],
        "status_code": fetch_result["status_code"],
        "content_type": fetch_result["content_type"],
        "fetch_ok": fetch_result["ok"],
        "fetch_error": fetch_result["error"],
        "title": parsed["title"],
        "author": parsed["author"],
        "date": parsed["date"],
        "hostname": parsed["hostname"],
        "article_text": parsed["article_text"],
        "parse_error": parsed["parse_error"],
    }

Batch process URLs

A crawler should store both successful and failed rows. Failed rows are useful because they show which URLs need retries, browser rendering, or manual review.

In [11]:
urls = [
    sample_url,
    "https://example.com/articles/oil-demand-update",
    "https://example.com/articles/missing-page",
]

sample_html_by_url = {
    sample_url: sample_html,
    "https://example.com/articles/oil-demand-update": """
    <html>
      <head><title>Oil demand concerns pressure prices</title></head>
      <body>
        <article>
          <h1>Oil demand concerns pressure prices</h1>
          <p>Oil prices declined after demand concerns increased.</p>
          <p>Energy traders watched central bank policy and inventory data.</p>
        </article>
      </body>
    </html>
    """,
}
In [12]:
records = []

for url in urls:
    if RUN_LIVE_REQUESTS:
        record = fetch_and_extract(url)
        time.sleep(REQUEST_DELAY_SECONDS)
    else:
        html = sample_html_by_url.get(url)
        fetch_result = {
            "ok": html is not None,
            "url": url,
            "final_url": url,
            "status_code": 200 if html is not None else 404,
            "content_type": "text/html; charset=utf-8" if html is not None else None,
            "html": html,
            "error": None if html is not None else "HTTP 404",
        }
        parsed = extract_article(html, url)
        record = {
            "url": url,
            "final_url": fetch_result["final_url"],
            "status_code": fetch_result["status_code"],
            "content_type": fetch_result["content_type"],
            "fetch_ok": fetch_result["ok"],
            "fetch_error": fetch_result["error"],
            "title": parsed["title"],
            "author": parsed["author"],
            "date": parsed["date"],
            "hostname": parsed["hostname"],
            "article_text": parsed["article_text"],
            "parse_error": parsed["parse_error"],
        }

    records.append(record)

df_articles = pd.DataFrame(records)
df_articles
Out[12]:
url final_url status_code content_type fetch_ok fetch_error title author date hostname article_text parse_error
0 https://example.com/articles/dubai-property-ma... https://example.com/articles/dubai-property-ma... 200 text/html; charset=utf-8 True NaN Dubai property prices rise again Market Desk 2026-05-01 example.com Dubai property prices rise again\nDubai real e... NaN
1 https://example.com/articles/oil-demand-update https://example.com/articles/oil-demand-update 200 text/html; charset=utf-8 True NaN Oil demand concerns pressure prices NaN NaN example.com Oil demand concerns pressure prices\nOil price... NaN
2 https://example.com/articles/missing-page https://example.com/articles/missing-page 404 NaN False HTTP 404 NaN NaN NaN example.com NaN No HTML available

Clean extracted rows

For downstream NLP work, keep rows where readable article text exists. Keep the full raw table separately so failed rows are not lost.

In [13]:
df_success = df_articles.dropna(subset=["article_text"]).copy()
df_success = df_success[df_success["article_text"].astype(str).str.strip() != ""]

df_success["text_length"] = df_success["article_text"].str.len()

print(f"Total URLs: {len(df_articles)}")
print(f"Successful extractions: {len(df_success)}")
print(f"Failed extractions: {len(df_articles) - len(df_success)}")

df_success[["title", "hostname", "text_length", "article_text"]]
Total URLs: 3
Successful extractions: 2
Failed extractions: 1
Out[13]:
title hostname text_length article_text
0 Dubai property prices rise again example.com 396 Dubai property prices rise again\nDubai real e...
1 Oil demand concerns pressure prices example.com 267 Oil demand concerns pressure prices\nOil price...

Inspect failures

Failure inspection is part of the workflow. It helps you decide whether a problem is network related, parsing related, or expected.

In [14]:
failure_columns = [
    "url",
    "status_code",
    "fetch_error",
    "parse_error",
]

df_failures = df_articles[
    (df_articles["fetch_error"].notna()) |
    (df_articles["parse_error"].notna())
][failure_columns]

df_failures
Out[14]:
url status_code fetch_error parse_error
2 https://example.com/articles/missing-page 404 HTTP 404 No HTML available