Article Fetching and Text Extraction Reference

This notebook is a practical reference for collecting readable article text from URLs. It separates the network step from the parsing step so each part can be tested and debugged independently.

Use this notebook when you need to:

download HTML with requests
inspect HTTP status codes and response metadata
extract readable article text with trafilatura
extract article metadata such as title, author, date, and hostname
batch process URLs into a clean pandas table
keep success and failure reasons visible while debugging

Install packages

Install the packages used in this notebook. Run this once in your notebook environment.

In [ ]:

# pip install requests trafilatura pandas

Imports and configuration

Keep crawler settings near the top. This makes it easy to change timeout, user agent, output path, and whether live requests should run.

In [ ]:

import time
from pathlib import Path
from urllib.parse import urlparse

import pandas as pd
import requests
import trafilatura

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

USER_AGENT = "AcademicResearchBot/0.1 (+contact@example.com)"
TIMEOUT_SECONDS = 10
REQUEST_DELAY_SECONDS = 1.0

RUN_LIVE_REQUESTS = False

Role separation

A stable article extraction workflow has two separate jobs.

Step	Tool	Job
Fetch HTML	`requests`	Connect to the URL and download the HTML document.
Extract content	`trafilatura`	Parse already downloaded HTML and return readable article text and metadata.

This separation is useful because a failed request and a failed parse are different problems.

Start with sample HTML

Use a small local HTML example before calling real websites. This makes the parsing step reproducible and keeps the notebook runnable without internet access.

In [3]:

sample_url = "https://example.com/articles/dubai-property-market"

sample_html = """
<html>
  <head>
    <title>Dubai property prices rise again</title>
    <meta name="author" content="Market Desk">
    <meta property="article:published_time" content="2026-05-01">
  </head>
  <body>
    <nav>Home | Markets | Real Estate</nav>
    <article>
      <h1>Dubai property prices rise again</h1>
      <p>Dubai real estate prices increased during the quarter.</p>
      <p>Analysts said demand for apartments and villas remained strong.</p>
      <p>Higher interest rates may slow some mortgage-backed purchases.</p>
    </article>
    <footer>Copyright and related links</footer>
  </body>
</html>
"""

Extract text from HTML

trafilatura.extract takes HTML text and returns the readable main content. In this pattern, Trafilatura does not fetch the URL. It only parses the HTML string you already have.

In [4]:

article_text = trafilatura.extract(
    sample_html,
    include_comments=False,
    include_tables=False,
    url=sample_url,
)

print(article_text)

Dubai property prices rise again
Dubai real estate prices increased during the quarter.
Analysts said demand for apartments and villas remained strong.
Higher interest rates may slow some mortgage-backed purchases.
Dubai real estate prices increased during the quarter.
Analysts said demand for apartments and villas remained strong.
Higher interest rates may slow some mortgage-backed purchases.

Extract metadata from HTML

Metadata extraction is separate from text extraction. Some pages have strong metadata. Others only have partial metadata or none at all.

In [5]:

metadata = trafilatura.extract_metadata(
    sample_html,
    default_url=sample_url,
)

metadata_row = {
    "title": metadata.title if metadata else None,
    "author": metadata.author if metadata else None,
    "date": metadata.date if metadata else None,
    "hostname": metadata.hostname if metadata else None,
    "url": metadata.url if metadata else sample_url,
}

metadata_row

Out[5]:

{'title': 'Dubai property prices rise again',
 'author': 'Market Desk',
 'date': '2026-05-01',
 'hostname': 'example.com',
 'url': 'https://example.com/articles/dubai-property-market'}

Fetch HTML with requests

requests.get is the network layer. It sends an HTTP request and returns a response object.

The response may fail before Trafilatura sees anything. Common failure reasons include timeout, blocked user agents, redirects, 403, 404, and server errors.

In [6]:

def fetch_html(url):
    headers = {
        "User-Agent": USER_AGENT,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    try:
        response = requests.get(
            url,
            headers=headers,
            timeout=TIMEOUT_SECONDS,
        )

        return {
            "ok": response.ok,
            "url": url,
            "final_url": response.url,
            "status_code": response.status_code,
            "content_type": response.headers.get("content-type"),
            "html": response.text if response.ok else None,
            "error": None if response.ok else f"HTTP {response.status_code}",
        }
    except requests.RequestException as exc:
        return {
            "ok": False,
            "url": url,
            "final_url": None,
            "status_code": None,
            "content_type": None,
            "html": None,
            "error": str(exc),
        }

Optional live request

The notebook keeps live requests disabled by default. Set RUN_LIVE_REQUESTS = True when you want to test against a real URL.

In [7]:

real_url = "https://example.com"

if RUN_LIVE_REQUESTS:
    fetch_result = fetch_html(real_url)
else:
    fetch_result = {
        "ok": True,
        "url": sample_url,
        "final_url": sample_url,
        "status_code": 200,
        "content_type": "text/html; charset=utf-8",
        "html": sample_html,
        "error": None,
    }

{k: v for k, v in fetch_result.items() if k != "html"}

Out[7]:

{'ok': True,
 'url': 'https://example.com/articles/dubai-property-market',
 'final_url': 'https://example.com/articles/dubai-property-market',
 'status_code': 200,
 'content_type': 'text/html; charset=utf-8',
 'error': None}

Parse fetched HTML

Once HTML exists, the parsing step should not care whether the HTML came from a real request or a local test fixture.

In [8]:

def extract_article(html, url):
    if not html:
        return {
            "article_text": None,
            "title": None,
            "author": None,
            "date": None,
            "hostname": urlparse(url).netloc,
            "parse_error": "No HTML available",
        }

    metadata = trafilatura.extract_metadata(
        html,
        default_url=url,
    )

    article_text = trafilatura.extract(
        html,
        include_comments=False,
        include_tables=False,
        url=url,
    )

    return {
        "article_text": article_text,
        "title": metadata.title if metadata else None,
        "author": metadata.author if metadata else None,
        "date": metadata.date if metadata else None,
        "hostname": metadata.hostname if metadata else urlparse(url).netloc,
        "parse_error": None if article_text else "No readable article text extracted",
    }

In [9]:

parsed = extract_article(
    fetch_result["html"],
    fetch_result["final_url"] or fetch_result["url"],
)

parsed

Out[9]:

{'article_text': 'Dubai property prices rise again\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.',
 'title': 'Dubai property prices rise again',
 'author': 'Market Desk',
 'date': '2026-05-01',
 'hostname': 'example.com',
 'parse_error': None}

Combine fetch and parse for one URL

This wrapper keeps the full record in one dictionary. It includes both network fields and parser fields, which makes debugging easier later.

In [10]:

def fetch_and_extract(url):
    fetch_result = fetch_html(url)

    parsed = extract_article(
        fetch_result["html"],
        fetch_result["final_url"] or url,
    )

    return {
        "url": url,
        "final_url": fetch_result["final_url"],
        "status_code": fetch_result["status_code"],
        "content_type": fetch_result["content_type"],
        "fetch_ok": fetch_result["ok"],
        "fetch_error": fetch_result["error"],
        "title": parsed["title"],
        "author": parsed["author"],
        "date": parsed["date"],
        "hostname": parsed["hostname"],
        "article_text": parsed["article_text"],
        "parse_error": parsed["parse_error"],
    }

Batch process URLs

A crawler should store both successful and failed rows. Failed rows are useful because they show which URLs need retries, browser rendering, or manual review.

In [11]:

urls = [
    sample_url,
    "https://example.com/articles/oil-demand-update",
    "https://example.com/articles/missing-page",
]

sample_html_by_url = {
    sample_url: sample_html,
    "https://example.com/articles/oil-demand-update": """
    <html>
      <head><title>Oil demand concerns pressure prices</title></head>
      <body>
        <article>
          <h1>Oil demand concerns pressure prices</h1>
          <p>Oil prices declined after demand concerns increased.</p>
          <p>Energy traders watched central bank policy and inventory data.</p>
        </article>
      </body>
    </html>
    """,
}

In [12]:

records = []

for url in urls:
    if RUN_LIVE_REQUESTS:
        record = fetch_and_extract(url)
        time.sleep(REQUEST_DELAY_SECONDS)
    else:
        html = sample_html_by_url.get(url)
        fetch_result = {
            "ok": html is not None,
            "url": url,
            "final_url": url,
            "status_code": 200 if html is not None else 404,
            "content_type": "text/html; charset=utf-8" if html is not None else None,
            "html": html,
            "error": None if html is not None else "HTTP 404",
        }
        parsed = extract_article(html, url)
        record = {
            "url": url,
            "final_url": fetch_result["final_url"],
            "status_code": fetch_result["status_code"],
            "content_type": fetch_result["content_type"],
            "fetch_ok": fetch_result["ok"],
            "fetch_error": fetch_result["error"],
            "title": parsed["title"],
            "author": parsed["author"],
            "date": parsed["date"],
            "hostname": parsed["hostname"],
            "article_text": parsed["article_text"],
            "parse_error": parsed["parse_error"],
        }

    records.append(record)

df_articles = pd.DataFrame(records)
df_articles

Out[12]:

	url	final_url	status_code	content_type	fetch_ok	fetch_error	title	author	date	hostname	article_text	parse_error
0	https://example.com/articles/dubai-property-ma...	https://example.com/articles/dubai-property-ma...	200	text/html; charset=utf-8	True	NaN	Dubai property prices rise again	Market Desk	2026-05-01	example.com	Dubai property prices rise again\nDubai real e...	NaN
1	https://example.com/articles/oil-demand-update	https://example.com/articles/oil-demand-update	200	text/html; charset=utf-8	True	NaN	Oil demand concerns pressure prices	NaN	NaN	example.com	Oil demand concerns pressure prices\nOil price...	NaN
2	https://example.com/articles/missing-page	https://example.com/articles/missing-page	404	NaN	False	HTTP 404	NaN	NaN	NaN	example.com	NaN	No HTML available

Clean extracted rows

For downstream NLP work, keep rows where readable article text exists. Keep the full raw table separately so failed rows are not lost.

In [13]:

df_success = df_articles.dropna(subset=["article_text"]).copy()
df_success = df_success[df_success["article_text"].astype(str).str.strip() != ""]

df_success["text_length"] = df_success["article_text"].str.len()

print(f"Total URLs: {len(df_articles)}")
print(f"Successful extractions: {len(df_success)}")
print(f"Failed extractions: {len(df_articles) - len(df_success)}")

df_success[["title", "hostname", "text_length", "article_text"]]

Total URLs: 3
Successful extractions: 2
Failed extractions: 1

Out[13]:

	title	hostname	text_length	article_text
0	Dubai property prices rise again	example.com	396	Dubai property prices rise again\nDubai real e...
1	Oil demand concerns pressure prices	example.com	267	Oil demand concerns pressure prices\nOil price...

Inspect failures

Failure inspection is part of the workflow. It helps you decide whether a problem is network related, parsing related, or expected.

In [14]:

failure_columns = [
    "url",
    "status_code",
    "fetch_error",
    "parse_error",
]

df_failures = df_articles[
    (df_articles["fetch_error"].notna()) |
    (df_articles["parse_error"].notna())
][failure_columns]

df_failures

Out[14]:

	url	status_code	fetch_error	parse_error
2	https://example.com/articles/missing-page	404	HTTP 404	No HTML available

Article Fetching and Text Extraction Reference

This notebook is a practical reference for collecting readable article text from URLs. It separates the network step from the parsing step so each part can be tested and debugged independently.

Use this notebook when you need to:

download HTML with requests
inspect HTTP status codes and response metadata
extract readable article text with trafilatura
extract article metadata such as title, author, date, and hostname
batch process URLs into a clean pandas table
keep success and failure reasons visible while debugging

Install packages

Install the packages used in this notebook. Run this once in your notebook environment.

In [ ]:

# pip install requests trafilatura pandas

Imports and configuration

Keep crawler settings near the top. This makes it easy to change timeout, user agent, output path, and whether live requests should run.

In [ ]:

import time
from pathlib import Path
from urllib.parse import urlparse

import pandas as pd
import requests
import trafilatura

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

USER_AGENT = "AcademicResearchBot/0.1 (+contact@example.com)"
TIMEOUT_SECONDS = 10
REQUEST_DELAY_SECONDS = 1.0

RUN_LIVE_REQUESTS = False

Role separation

A stable article extraction workflow has two separate jobs.

Step	Tool	Job
Fetch HTML	`requests`	Connect to the URL and download the HTML document.
Extract content	`trafilatura`	Parse already downloaded HTML and return readable article text and metadata.

This separation is useful because a failed request and a failed parse are different problems.

Start with sample HTML

Use a small local HTML example before calling real websites. This makes the parsing step reproducible and keeps the notebook runnable without internet access.

In [3]:

sample_url = "https://example.com/articles/dubai-property-market"

sample_html = """
<html>
  <head>
    <title>Dubai property prices rise again</title>
    <meta name="author" content="Market Desk">
    <meta property="article:published_time" content="2026-05-01">
  </head>
  <body>
    <nav>Home | Markets | Real Estate</nav>
    <article>
      <h1>Dubai property prices rise again</h1>
      <p>Dubai real estate prices increased during the quarter.</p>
      <p>Analysts said demand for apartments and villas remained strong.</p>
      <p>Higher interest rates may slow some mortgage-backed purchases.</p>
    </article>
    <footer>Copyright and related links</footer>
  </body>
</html>
"""

Extract text from HTML

trafilatura.extract takes HTML text and returns the readable main content. In this pattern, Trafilatura does not fetch the URL. It only parses the HTML string you already have.

In [4]:

article_text = trafilatura.extract(
    sample_html,
    include_comments=False,
    include_tables=False,
    url=sample_url,
)

print(article_text)

Dubai property prices rise again
Dubai real estate prices increased during the quarter.
Analysts said demand for apartments and villas remained strong.
Higher interest rates may slow some mortgage-backed purchases.
Dubai real estate prices increased during the quarter.
Analysts said demand for apartments and villas remained strong.
Higher interest rates may slow some mortgage-backed purchases.

Extract metadata from HTML

Metadata extraction is separate from text extraction. Some pages have strong metadata. Others only have partial metadata or none at all.

In [5]:

metadata = trafilatura.extract_metadata(
    sample_html,
    default_url=sample_url,
)

metadata_row = {
    "title": metadata.title if metadata else None,
    "author": metadata.author if metadata else None,
    "date": metadata.date if metadata else None,
    "hostname": metadata.hostname if metadata else None,
    "url": metadata.url if metadata else sample_url,
}

metadata_row

Out[5]:

{'title': 'Dubai property prices rise again',
 'author': 'Market Desk',
 'date': '2026-05-01',
 'hostname': 'example.com',
 'url': 'https://example.com/articles/dubai-property-market'}

Fetch HTML with requests

requests.get is the network layer. It sends an HTTP request and returns a response object.

The response may fail before Trafilatura sees anything. Common failure reasons include timeout, blocked user agents, redirects, 403, 404, and server errors.

In [6]:

def fetch_html(url):
    headers = {
        "User-Agent": USER_AGENT,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    }

    try:
        response = requests.get(
            url,
            headers=headers,
            timeout=TIMEOUT_SECONDS,
        )

        return {
            "ok": response.ok,
            "url": url,
            "final_url": response.url,
            "status_code": response.status_code,
            "content_type": response.headers.get("content-type"),
            "html": response.text if response.ok else None,
            "error": None if response.ok else f"HTTP {response.status_code}",
        }
    except requests.RequestException as exc:
        return {
            "ok": False,
            "url": url,
            "final_url": None,
            "status_code": None,
            "content_type": None,
            "html": None,
            "error": str(exc),
        }

Optional live request

The notebook keeps live requests disabled by default. Set RUN_LIVE_REQUESTS = True when you want to test against a real URL.

In [7]:

real_url = "https://example.com"

if RUN_LIVE_REQUESTS:
    fetch_result = fetch_html(real_url)
else:
    fetch_result = {
        "ok": True,
        "url": sample_url,
        "final_url": sample_url,
        "status_code": 200,
        "content_type": "text/html; charset=utf-8",
        "html": sample_html,
        "error": None,
    }

{k: v for k, v in fetch_result.items() if k != "html"}

Out[7]:

{'ok': True,
 'url': 'https://example.com/articles/dubai-property-market',
 'final_url': 'https://example.com/articles/dubai-property-market',
 'status_code': 200,
 'content_type': 'text/html; charset=utf-8',
 'error': None}

Parse fetched HTML

Once HTML exists, the parsing step should not care whether the HTML came from a real request or a local test fixture.

In [8]:

def extract_article(html, url):
    if not html:
        return {
            "article_text": None,
            "title": None,
            "author": None,
            "date": None,
            "hostname": urlparse(url).netloc,
            "parse_error": "No HTML available",
        }

    metadata = trafilatura.extract_metadata(
        html,
        default_url=url,
    )

    article_text = trafilatura.extract(
        html,
        include_comments=False,
        include_tables=False,
        url=url,
    )

    return {
        "article_text": article_text,
        "title": metadata.title if metadata else None,
        "author": metadata.author if metadata else None,
        "date": metadata.date if metadata else None,
        "hostname": metadata.hostname if metadata else urlparse(url).netloc,
        "parse_error": None if article_text else "No readable article text extracted",
    }

In [9]:

parsed = extract_article(
    fetch_result["html"],
    fetch_result["final_url"] or fetch_result["url"],
)

parsed

Out[9]:

{'article_text': 'Dubai property prices rise again\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.\nDubai real estate prices increased during the quarter.\nAnalysts said demand for apartments and villas remained strong.\nHigher interest rates may slow some mortgage-backed purchases.',
 'title': 'Dubai property prices rise again',
 'author': 'Market Desk',
 'date': '2026-05-01',
 'hostname': 'example.com',
 'parse_error': None}

Combine fetch and parse for one URL

This wrapper keeps the full record in one dictionary. It includes both network fields and parser fields, which makes debugging easier later.

In [10]:

def fetch_and_extract(url):
    fetch_result = fetch_html(url)

    parsed = extract_article(
        fetch_result["html"],
        fetch_result["final_url"] or url,
    )

    return {
        "url": url,
        "final_url": fetch_result["final_url"],
        "status_code": fetch_result["status_code"],
        "content_type": fetch_result["content_type"],
        "fetch_ok": fetch_result["ok"],
        "fetch_error": fetch_result["error"],
        "title": parsed["title"],
        "author": parsed["author"],
        "date": parsed["date"],
        "hostname": parsed["hostname"],
        "article_text": parsed["article_text"],
        "parse_error": parsed["parse_error"],
    }

Batch process URLs

A crawler should store both successful and failed rows. Failed rows are useful because they show which URLs need retries, browser rendering, or manual review.

In [11]:

urls = [
    sample_url,
    "https://example.com/articles/oil-demand-update",
    "https://example.com/articles/missing-page",
]

sample_html_by_url = {
    sample_url: sample_html,
    "https://example.com/articles/oil-demand-update": """
    <html>
      <head><title>Oil demand concerns pressure prices</title></head>
      <body>
        <article>
          <h1>Oil demand concerns pressure prices</h1>
          <p>Oil prices declined after demand concerns increased.</p>
          <p>Energy traders watched central bank policy and inventory data.</p>
        </article>
      </body>
    </html>
    """,
}

In [12]:

records = []

for url in urls:
    if RUN_LIVE_REQUESTS:
        record = fetch_and_extract(url)
        time.sleep(REQUEST_DELAY_SECONDS)
    else:
        html = sample_html_by_url.get(url)
        fetch_result = {
            "ok": html is not None,
            "url": url,
            "final_url": url,
            "status_code": 200 if html is not None else 404,
            "content_type": "text/html; charset=utf-8" if html is not None else None,
            "html": html,
            "error": None if html is not None else "HTTP 404",
        }
        parsed = extract_article(html, url)
        record = {
            "url": url,
            "final_url": fetch_result["final_url"],
            "status_code": fetch_result["status_code"],
            "content_type": fetch_result["content_type"],
            "fetch_ok": fetch_result["ok"],
            "fetch_error": fetch_result["error"],
            "title": parsed["title"],
            "author": parsed["author"],
            "date": parsed["date"],
            "hostname": parsed["hostname"],
            "article_text": parsed["article_text"],
            "parse_error": parsed["parse_error"],
        }

    records.append(record)

df_articles = pd.DataFrame(records)
df_articles

Out[12]:

	url	final_url	status_code	content_type	fetch_ok	fetch_error	title	author	date	hostname	article_text	parse_error
0	https://example.com/articles/dubai-property-ma...	https://example.com/articles/dubai-property-ma...	200	text/html; charset=utf-8	True	NaN	Dubai property prices rise again	Market Desk	2026-05-01	example.com	Dubai property prices rise again\nDubai real e...	NaN
1	https://example.com/articles/oil-demand-update	https://example.com/articles/oil-demand-update	200	text/html; charset=utf-8	True	NaN	Oil demand concerns pressure prices	NaN	NaN	example.com	Oil demand concerns pressure prices\nOil price...	NaN
2	https://example.com/articles/missing-page	https://example.com/articles/missing-page	404	NaN	False	HTTP 404	NaN	NaN	NaN	example.com	NaN	No HTML available

Clean extracted rows

For downstream NLP work, keep rows where readable article text exists. Keep the full raw table separately so failed rows are not lost.

In [13]:

df_success = df_articles.dropna(subset=["article_text"]).copy()
df_success = df_success[df_success["article_text"].astype(str).str.strip() != ""]

df_success["text_length"] = df_success["article_text"].str.len()

print(f"Total URLs: {len(df_articles)}")
print(f"Successful extractions: {len(df_success)}")
print(f"Failed extractions: {len(df_articles) - len(df_success)}")

df_success[["title", "hostname", "text_length", "article_text"]]

Total URLs: 3
Successful extractions: 2
Failed extractions: 1

Out[13]:

	title	hostname	text_length	article_text
0	Dubai property prices rise again	example.com	396	Dubai property prices rise again\nDubai real e...
1	Oil demand concerns pressure prices	example.com	267	Oil demand concerns pressure prices\nOil price...

Inspect failures

Failure inspection is part of the workflow. It helps you decide whether a problem is network related, parsing related, or expected.

In [14]:

failure_columns = [
    "url",
    "status_code",
    "fetch_error",
    "parse_error",
]

df_failures = df_articles[
    (df_articles["fetch_error"].notna()) |
    (df_articles["parse_error"].notna())
][failure_columns]

df_failures

Out[14]:

	url	status_code	fetch_error	parse_error
2	https://example.com/articles/missing-page	404	HTTP 404	No HTML available