NLP Text Embedding and Sentiment Practice Reference
This notebook is a practical reference for common NLP tasks used in article analysis workflows. It covers text cleaning, safe keyword filtering, source filtering, VADER sentiment scoring, sentence embeddings, cosine similarity, semantic search, and article text extraction.
The examples use a small in-memory article dataset so the notebook can be run without external files. Replace the sample DataFrame with your own article table when adapting this to a larger pipeline.
Notebook Scope
Use this notebook as a supporting practice and debugging reference, not as a single end-to-end production pipeline.
It is useful when you need to:
- Clean article text before NLP processing.
- Build keyword filters that do not break on special characters.
- Score article sentiment with a fast rule-based baseline.
- Convert sentences or articles into embedding vectors.
- Compare text meaning with cosine similarity.
- Build a small semantic search example.
- Keep URL fetching separate from article text extraction.
Install Packages
Run this cell once in a fresh environment. The embedding model is downloaded the first time it is loaded.
%pip install pandas numpy nltk sentence-transformers scikit-learn requests trafilatura
Imports and Settings
The examples use pandas for tabular data, regular expressions for keyword matching, VADER for sentiment, and SentenceTransformer for sentence-level embeddings.
import re
import numpy as np
import pandas as pd
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
pd.set_option("display.max_colwidth", 120)
nltk.download("vader_lexicon", quiet=True)
True
Create a Small Article Dataset
The reference examples use a small article dataset with realistic columns:
article_textcontains the article body or summary.clean_sourcecontains a normalized source name.urlcontains the source URL.
In a real workflow, replace this cell with a CSV read, database query, API result, or crawler output.
articles = [
{
"clean_source": "Reuters",
"url": "https://example.com/dubai-real-estate-1",
"article_text": "Dubai real estate prices increased this quarter as apartment sales reached a new high.",
},
{
"clean_source": "Bloomberg",
"url": "https://example.com/rates-1",
"article_text": "Interest rates remained high, but investors expect policy easing later this year.",
},
{
"clean_source": "Local Sports Desk",
"url": "https://example.com/sports-1",
"article_text": "The football match ended in a draw after both teams missed late chances.",
},
{
"clean_source": "CNBC",
"url": "https://example.com/sp500-1",
"article_text": "The S&P 500 closed higher as technology shares recovered from early losses.",
},
{
"clean_source": "Regional Property News",
"url": "https://example.com/dubai-real-estate-2",
"article_text": "Property values in Dubai rose again, supported by foreign demand and new luxury projects.",
},
{
"clean_source": "Unknown Feed",
"url": "https://example.com/missing-text",
"article_text": "",
},
]
df = pd.DataFrame(articles)
print(f"Initial dataset size: {len(df)} rows")
df
Initial dataset size: 6 rows
| clean_source | url | article_text | |
|---|---|---|---|
| 0 | Reuters | https://example.com/dubai-real-estate-1 | Dubai real estate prices increased this quarter as apartment sales reached a new high. |
| 1 | Bloomberg | https://example.com/rates-1 | Interest rates remained high, but investors expect policy easing later this year. |
| 2 | Local Sports Desk | https://example.com/sports-1 | The football match ended in a draw after both teams missed late chances. |
| 3 | CNBC | https://example.com/sp500-1 | The S&P 500 closed higher as technology shares recovered from early losses. |
| 4 | Regional Property News | https://example.com/dubai-real-estate-2 | Property values in Dubai rose again, supported by foreign demand and new luxury projects. |
| 5 | Unknown Feed | https://example.com/missing-text |
Clean Article Text Rows
Start by removing rows where article text is missing or empty. This keeps keyword filtering, sentiment scoring, and embedding generation stable.
print(f"Initial dataset size: {len(df)} rows")
df = df.dropna(subset=["article_text"])
df = df[df["article_text"].astype(str).str.strip() != ""].copy()
df["article_text"] = df["article_text"].astype(str).str.strip()
print(f"After removing missing or empty article_text: {len(df)} rows")
Initial dataset size: 6 rows After removing missing or empty article_text: 5 rows
Build a Safe Keyword Regex
Use re.escape when keywords are normal text. This prevents characters such as plus, dot, brackets, ampersand, or S&P from being interpreted as regex syntax.
all_keywords = [
"Dubai real estate",
"property prices",
"S&P 500",
"interest rates",
]
keyword_pattern = re.compile(
"(" + "|".join(re.escape(keyword) for keyword in all_keywords) + ")",
re.IGNORECASE,
)
mask_keywords = df["article_text"].astype(str).str.contains(
keyword_pattern,
na=False,
)
df_keywords = df[mask_keywords].copy()
print(f"After keyword filtering: {len(df_keywords)} rows")
df_keywords[["clean_source", "article_text"]]
After keyword filtering: 3 rows
/tmp/ipykernel_23700/625636714.py:13: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract. mask_keywords = df["article_text"].astype(str).str.contains(
| clean_source | article_text | |
|---|---|---|
| 0 | Reuters | Dubai real estate prices increased this quarter as apartment sales reached a new high. |
| 1 | Bloomberg | Interest rates remained high, but investors expect policy easing later this year. |
| 3 | CNBC | The S&P 500 closed higher as technology shares recovered from early losses. |
Key points:
re.compileprepares the pattern once.re.IGNORECASEmatches upper and lower case versions.- The pipe character means OR in the regex pattern.
re.escapekeeps the keyword list literal.
Source Name Filtering with Regex
The same pattern logic can filter sources. This is useful when only selected English sources or trusted publishers should be kept.
english_sources = [
"Reuters",
"Bloomberg",
"CNBC",
]
source_pattern = "|".join(re.escape(source) for source in english_sources)
mask_english = df["clean_source"].astype(str).str.contains(
source_pattern,
case=False,
na=False,
regex=True,
)
df_english = df[mask_english].copy()
df_english[["clean_source", "article_text"]]
| clean_source | article_text | |
|---|---|---|
| 0 | Reuters | Dubai real estate prices increased this quarter as apartment sales reached a new high. |
| 1 | Bloomberg | Interest rates remained high, but investors expect policy easing later this year. |
| 3 | CNBC | The S&P 500 closed higher as technology shares recovered from early losses. |
VADER Sentiment Scoring
VADER returns negative, neutral, positive, and compound scores. The compound score is the easiest single sentiment feature. It ranges roughly from negative one to positive one.
sia = SentimentIntensityAnalyzer()
example_scores = sia.polarity_scores(
"The market outlook is strong but uncertainty remains."
)
example_scores
{'neg': 0.276, 'neu': 0.533, 'pos': 0.191, 'compound': -0.2382}
df["vader_compound"] = df["article_text"].astype(str).apply(
lambda text: sia.polarity_scores(text)["compound"]
)
df[["clean_source", "article_text", "vader_compound"]].sort_values(
"vader_compound",
ascending=False,
)
| clean_source | article_text | vader_compound | |
|---|---|---|---|
| 1 | Bloomberg | Interest rates remained high, but investors expect policy easing later this year. | 0.5423 |
| 4 | Regional Property News | Property values in Dubai rose again, supported by foreign demand and new luxury projects. | 0.5423 |
| 0 | Reuters | Dubai real estate prices increased this quarter as apartment sales reached a new high. | 0.3612 |
| 2 | Local Sports Desk | The football match ended in a draw after both teams missed late chances. | -0.1027 |
| 3 | CNBC | The S&P 500 closed higher as technology shares recovered from early losses. | -0.1280 |
Interpretation notes:
- Positive compound values indicate more positive wording.
- Negative compound values indicate more negative wording.
- Values near zero are neutral or mixed.
- VADER is rule-based, not a large language model.
Load a Sentence Embedding Model
SentenceTransformer loads a pretrained model that converts text into numeric vectors. The all-MiniLM-L6-v2 model is small, fast, and commonly used for semantic similarity workflows.
The first run may take longer because the model files need to be downloaded.
model_name = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)
model_name
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Loading weights: 100%|██████████| 103/103 [00:00<00:00, 3252.94it/s]
'sentence-transformers/all-MiniLM-L6-v2'
Encode Sentences into Vectors
encode sends each sentence through the transformer model and returns one vector per sentence. For all-MiniLM-L6-v2, each sentence vector has 384 values.
sentences = [
"Dubai real estate prices increased this quarter.",
"Property values in Dubai rose this quarter.",
"The football match ended in a draw.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
(3, 384)
Embedding shape interpretation:
- Each row is one sentence.
- Each column is one numeric dimension of the embedding.
- The individual numbers are not interpreted manually.
- The useful signal comes from comparing whole vectors.
Cosine Similarity
Cosine similarity compares vector direction. It is useful for sentence embeddings because similar meanings should point in similar directions.
similarity_matrix = cosine_similarity(embeddings)
pd.DataFrame(
similarity_matrix,
index=["sentence_0", "sentence_1", "sentence_2"],
columns=["sentence_0", "sentence_1", "sentence_2"],
)
| sentence_0 | sentence_1 | sentence_2 | |
|---|---|---|---|
| sentence_0 | 1.000000 | 0.66181 | -0.025221 |
| sentence_1 | 0.661810 | 1.00000 | 0.012650 |
| sentence_2 | -0.025221 | 0.01265 | 1.000000 |
How to read the matrix:
- A value near 1 means very similar direction.
- A value near 0 means weak relation or unrelated meaning.
- A negative value means the vectors point in different directions.
- The diagonal values are 1 because each sentence is compared with itself.
Manual Cosine Calculation
Cosine similarity is the dot product divided by both vector lengths. The vector length is the Euclidean norm.
a = embeddings[0]
b = embeddings[1]
manual_cosine = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
sklearn_cosine = cosine_similarity([a], [b])[0][0]
print(round(float(manual_cosine), 6))
print(round(float(sklearn_cosine), 6))
0.66181 0.66181
Why this works:
np.dot(a, b)measures how much the vectors point together, scaled by their lengths.np.linalg.norm(a)gives the Euclidean length of vectora.- Dividing by both lengths removes scale and keeps direction similarity.
Semantic Search Example
Sentence embeddings are better for semantic search than word arithmetic. Use one query vector and compare it with article or sentence vectors.
query = "Dubai property prices are rising"
article_texts = df["article_text"].tolist()
query_embedding = model.encode([query])
article_embeddings = model.encode(article_texts)
scores = cosine_similarity(query_embedding, article_embeddings)[0]
search_results = pd.DataFrame(
{
"score": scores,
"clean_source": df["clean_source"].values,
"article_text": article_texts,
}
).sort_values("score", ascending=False)
search_results
| score | clean_source | article_text | |
|---|---|---|---|
| 0 | 0.744305 | Reuters | Dubai real estate prices increased this quarter as apartment sales reached a new high. |
| 4 | 0.616608 | Regional Property News | Property values in Dubai rose again, supported by foreign demand and new luxury projects. |
| 3 | 0.230750 | CNBC | The S&P 500 closed higher as technology shares recovered from early losses. |
| 1 | 0.210383 | Bloomberg | Interest rates remained high, but investors expect policy easing later this year. |
| 2 | -0.086310 | Local Sports Desk | The football match ended in a draw after both teams missed late chances. |