Ctrl K

NLP Text Embedding and Sentiment Practice Reference

This notebook is a practical reference for common NLP tasks used in article analysis workflows. It covers text cleaning, safe keyword filtering, source filtering, VADER sentiment scoring, sentence embeddings, cosine similarity, semantic search, and article text extraction.

The examples use a small in-memory article dataset so the notebook can be run without external files. Replace the sample DataFrame with your own article table when adapting this to a larger pipeline.

Notebook Scope

Use this notebook as a supporting practice and debugging reference, not as a single end-to-end production pipeline.

It is useful when you need to:

  • Clean article text before NLP processing.
  • Build keyword filters that do not break on special characters.
  • Score article sentiment with a fast rule-based baseline.
  • Convert sentences or articles into embedding vectors.
  • Compare text meaning with cosine similarity.
  • Build a small semantic search example.
  • Keep URL fetching separate from article text extraction.

Install Packages

Run this cell once in a fresh environment. The embedding model is downloaded the first time it is loaded.

In [ ]:
%pip install pandas numpy nltk sentence-transformers scikit-learn requests trafilatura

Imports and Settings

The examples use pandas for tabular data, regular expressions for keyword matching, VADER for sentiment, and SentenceTransformer for sentence-level embeddings.

In [2]:
import re

import numpy as np
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

pd.set_option("display.max_colwidth", 120)

nltk.download("vader_lexicon", quiet=True)
Out[2]:
True

Create a Small Article Dataset

The reference examples use a small article dataset with realistic columns:

  • article_text contains the article body or summary.
  • clean_source contains a normalized source name.
  • url contains the source URL.

In a real workflow, replace this cell with a CSV read, database query, API result, or crawler output.

In [3]:
articles = [
    {
        "clean_source": "Reuters",
        "url": "https://example.com/dubai-real-estate-1",
        "article_text": "Dubai real estate prices increased this quarter as apartment sales reached a new high.",
    },
    {
        "clean_source": "Bloomberg",
        "url": "https://example.com/rates-1",
        "article_text": "Interest rates remained high, but investors expect policy easing later this year.",
    },
    {
        "clean_source": "Local Sports Desk",
        "url": "https://example.com/sports-1",
        "article_text": "The football match ended in a draw after both teams missed late chances.",
    },
    {
        "clean_source": "CNBC",
        "url": "https://example.com/sp500-1",
        "article_text": "The S&P 500 closed higher as technology shares recovered from early losses.",
    },
    {
        "clean_source": "Regional Property News",
        "url": "https://example.com/dubai-real-estate-2",
        "article_text": "Property values in Dubai rose again, supported by foreign demand and new luxury projects.",
    },
    {
        "clean_source": "Unknown Feed",
        "url": "https://example.com/missing-text",
        "article_text": "",
    },
]

df = pd.DataFrame(articles)

print(f"Initial dataset size: {len(df)} rows")
df
Initial dataset size: 6 rows
Out[3]:
clean_source url article_text
0 Reuters https://example.com/dubai-real-estate-1 Dubai real estate prices increased this quarter as apartment sales reached a new high.
1 Bloomberg https://example.com/rates-1 Interest rates remained high, but investors expect policy easing later this year.
2 Local Sports Desk https://example.com/sports-1 The football match ended in a draw after both teams missed late chances.
3 CNBC https://example.com/sp500-1 The S&P 500 closed higher as technology shares recovered from early losses.
4 Regional Property News https://example.com/dubai-real-estate-2 Property values in Dubai rose again, supported by foreign demand and new luxury projects.
5 Unknown Feed https://example.com/missing-text

Clean Article Text Rows

Start by removing rows where article text is missing or empty. This keeps keyword filtering, sentiment scoring, and embedding generation stable.

In [4]:
print(f"Initial dataset size: {len(df)} rows")

df = df.dropna(subset=["article_text"])
df = df[df["article_text"].astype(str).str.strip() != ""].copy()

df["article_text"] = df["article_text"].astype(str).str.strip()

print(f"After removing missing or empty article_text: {len(df)} rows")
Initial dataset size: 6 rows
After removing missing or empty article_text: 5 rows

Build a Safe Keyword Regex

Use re.escape when keywords are normal text. This prevents characters such as plus, dot, brackets, ampersand, or S&P from being interpreted as regex syntax.

In [5]:
all_keywords = [
    "Dubai real estate",
    "property prices",
    "S&P 500",
    "interest rates",
]

keyword_pattern = re.compile(
    "(" + "|".join(re.escape(keyword) for keyword in all_keywords) + ")",
    re.IGNORECASE,
)

mask_keywords = df["article_text"].astype(str).str.contains(
    keyword_pattern,
    na=False,
)

df_keywords = df[mask_keywords].copy()

print(f"After keyword filtering: {len(df_keywords)} rows")
df_keywords[["clean_source", "article_text"]]
After keyword filtering: 3 rows
/tmp/ipykernel_23700/625636714.py:13: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
  mask_keywords = df["article_text"].astype(str).str.contains(
Out[5]:
clean_source article_text
0 Reuters Dubai real estate prices increased this quarter as apartment sales reached a new high.
1 Bloomberg Interest rates remained high, but investors expect policy easing later this year.
3 CNBC The S&P 500 closed higher as technology shares recovered from early losses.

Key points:

  • re.compile prepares the pattern once.
  • re.IGNORECASE matches upper and lower case versions.
  • The pipe character means OR in the regex pattern.
  • re.escape keeps the keyword list literal.

Source Name Filtering with Regex

The same pattern logic can filter sources. This is useful when only selected English sources or trusted publishers should be kept.

In [6]:
english_sources = [
    "Reuters",
    "Bloomberg",
    "CNBC",
]

source_pattern = "|".join(re.escape(source) for source in english_sources)

mask_english = df["clean_source"].astype(str).str.contains(
    source_pattern,
    case=False,
    na=False,
    regex=True,
)

df_english = df[mask_english].copy()

df_english[["clean_source", "article_text"]]
Out[6]:
clean_source article_text
0 Reuters Dubai real estate prices increased this quarter as apartment sales reached a new high.
1 Bloomberg Interest rates remained high, but investors expect policy easing later this year.
3 CNBC The S&P 500 closed higher as technology shares recovered from early losses.

VADER Sentiment Scoring

VADER returns negative, neutral, positive, and compound scores. The compound score is the easiest single sentiment feature. It ranges roughly from negative one to positive one.

In [7]:
sia = SentimentIntensityAnalyzer()

example_scores = sia.polarity_scores(
    "The market outlook is strong but uncertainty remains."
)

example_scores
Out[7]:
{'neg': 0.276, 'neu': 0.533, 'pos': 0.191, 'compound': -0.2382}
In [8]:
df["vader_compound"] = df["article_text"].astype(str).apply(
    lambda text: sia.polarity_scores(text)["compound"]
)

df[["clean_source", "article_text", "vader_compound"]].sort_values(
    "vader_compound",
    ascending=False,
)
Out[8]:
clean_source article_text vader_compound
1 Bloomberg Interest rates remained high, but investors expect policy easing later this year. 0.5423
4 Regional Property News Property values in Dubai rose again, supported by foreign demand and new luxury projects. 0.5423
0 Reuters Dubai real estate prices increased this quarter as apartment sales reached a new high. 0.3612
2 Local Sports Desk The football match ended in a draw after both teams missed late chances. -0.1027
3 CNBC The S&P 500 closed higher as technology shares recovered from early losses. -0.1280

Interpretation notes:

  • Positive compound values indicate more positive wording.
  • Negative compound values indicate more negative wording.
  • Values near zero are neutral or mixed.
  • VADER is rule-based, not a large language model.

Load a Sentence Embedding Model

SentenceTransformer loads a pretrained model that converts text into numeric vectors. The all-MiniLM-L6-v2 model is small, fast, and commonly used for semantic similarity workflows.

The first run may take longer because the model files need to be downloaded.

In [9]:
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)

model_name
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 3252.94it/s]
Out[9]:
'sentence-transformers/all-MiniLM-L6-v2'

Encode Sentences into Vectors

encode sends each sentence through the transformer model and returns one vector per sentence. For all-MiniLM-L6-v2, each sentence vector has 384 values.

In [10]:
sentences = [
    "Dubai real estate prices increased this quarter.",
    "Property values in Dubai rose this quarter.",
    "The football match ended in a draw.",
]

embeddings = model.encode(sentences)

print(embeddings.shape)
(3, 384)

Embedding shape interpretation:

  • Each row is one sentence.
  • Each column is one numeric dimension of the embedding.
  • The individual numbers are not interpreted manually.
  • The useful signal comes from comparing whole vectors.

Cosine Similarity

Cosine similarity compares vector direction. It is useful for sentence embeddings because similar meanings should point in similar directions.

In [11]:
similarity_matrix = cosine_similarity(embeddings)

pd.DataFrame(
    similarity_matrix,
    index=["sentence_0", "sentence_1", "sentence_2"],
    columns=["sentence_0", "sentence_1", "sentence_2"],
)
Out[11]:
sentence_0 sentence_1 sentence_2
sentence_0 1.000000 0.66181 -0.025221
sentence_1 0.661810 1.00000 0.012650
sentence_2 -0.025221 0.01265 1.000000

How to read the matrix:

  • A value near 1 means very similar direction.
  • A value near 0 means weak relation or unrelated meaning.
  • A negative value means the vectors point in different directions.
  • The diagonal values are 1 because each sentence is compared with itself.

Manual Cosine Calculation

Cosine similarity is the dot product divided by both vector lengths. The vector length is the Euclidean norm.

In [12]:
a = embeddings[0]
b = embeddings[1]

manual_cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sklearn_cosine = cosine_similarity([a], [b])[0][0]

print(round(float(manual_cosine), 6))
print(round(float(sklearn_cosine), 6))
0.66181
0.66181

Why this works:

  • np.dot(a, b) measures how much the vectors point together, scaled by their lengths.
  • np.linalg.norm(a) gives the Euclidean length of vector a.
  • Dividing by both lengths removes scale and keeps direction similarity.

Semantic Search Example

Sentence embeddings are better for semantic search than word arithmetic. Use one query vector and compare it with article or sentence vectors.

In [13]:
query = "Dubai property prices are rising"

article_texts = df["article_text"].tolist()

query_embedding = model.encode([query])
article_embeddings = model.encode(article_texts)

scores = cosine_similarity(query_embedding, article_embeddings)[0]

search_results = pd.DataFrame(
    {
        "score": scores,
        "clean_source": df["clean_source"].values,
        "article_text": article_texts,
    }
).sort_values("score", ascending=False)

search_results
Out[13]:
score clean_source article_text
0 0.744305 Reuters Dubai real estate prices increased this quarter as apartment sales reached a new high.
4 0.616608 Regional Property News Property values in Dubai rose again, supported by foreign demand and new luxury projects.
3 0.230750 CNBC The S&P 500 closed higher as technology shares recovered from early losses.
1 0.210383 Bloomberg Interest rates remained high, but investors expect policy easing later this year.
2 -0.086310 Local Sports Desk The football match ended in a draw after both teams missed late chances.