Citation Retriever for BibTeX
Parse a .bib file, resolve citation counts from the Semantic Scholar API, and produce a ranked summary of your bibliography.
References
- Semantic Scholar API: https://api.semanticscholar.org/
- bibtexparser docs: https://bibtexparser.readthedocs.io/
Requirements
- A
.bibfile (e.g. exported from Zotero, Mendeley, or LaTeX) pip install bibtexparser requests pandas
In [23]:
import time
import bibtexparser
import requests
import pandas as pd
Configuration
Path to the .bib file and Semantic Scholar API endpoint. Adjust BIB_PATH to point to your bibliography.
In [17]:
BIB_PATH = "data/LIBRARY.bib"
S2_BASE = "https://api.semanticscholar.org/graph/v1/paper"
S2_SEARCH = f"{S2_BASE}/search"
DELAY = 0.5 # seconds between API calls to respect rate limits
print(f"BibTeX file : {BIB_PATH}")
print(f"API base : {S2_BASE}")
print(f"Rate-limit : {DELAY}s between requests")
BibTeX file : data/LIBRARY.bib API base : https://api.semanticscholar.org/graph/v1/paper Rate-limit : 0.5s between requests
Parse the BibTeX file
Load all entries and extract key metadata fields.
In [18]:
with open(BIB_PATH, "r", encoding="utf-8") as f:
bib_db = bibtexparser.load(f)
print(f"Parsed entries : {len(bib_db.entries)}")
Parsed entries : 20
Resolve citation counts
For each entry, first try a DOI lookup. If that fails or the entry has no DOI, fall back to a title search. Results are collected into a list of records.
In [19]:
records = []
total = len(bib_db.entries)
for i, entry in enumerate(bib_db.entries):
title = entry.get("title", "").replace("{", "").replace("}", "")
doi = entry.get("doi", "")
year = entry.get("year", "")
key = entry.get("ID", "")
citation_count = None
# DOI lookup
if doi:
r = requests.get(
f"{S2_BASE}/DOI:{doi}",
params={"fields": "citationCount"},
)
if r.ok:
citation_count = r.json().get("citationCount")
# title search fallback
if citation_count is None:
r = requests.get(
S2_SEARCH,
params={"query": title, "fields": "citationCount", "limit": 1},
)
if r.ok:
data = r.json().get("data", [])
if data:
citation_count = data[0].get("citationCount")
records.append({
"key": key,
"title": title,
"year": year,
"doi": doi,
"citations": citation_count,
})
print(f"[{i+1}/{total}] {citation_count or '?':>6} {title[:70]}")
time.sleep(DELAY)
[1/20] 4 R$^\textrm2$ LLMs: Retrieval and Ranking with LLMs [2/20] 1620 The use of MMR, diversity-based reranking for reordering documents and [3/20] 5033 The Probabilistic Relevance Framework: BM25 and Beyond [4/20] 700 Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abs [5/20] 12 Review on Query-focused Multi-document Summarization (QMDS) with Compa [6/20] ? RankLLM: A Python Package for Reranking with LLMs [7/20] 565 Beyond independent relevance: methods and evaluation metrics for subto [8/20] 478 Is ChatGPT Good at Search? Investigating Large Language Models as Re-R [9/20] 94 Zero-Shot Listwise Document Reranking with a Large Language Model [10/20] 73 RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Lar [11/20] 525 Large Language Models for Information Retrieval: A Survey [12/20] ? Found in the Middle: Permutation Self-Consistency Improves Listwise Ra [13/20] ? Document Ranking with a Pretrained Sequence-to-Sequence Model [14/20] 3345 Lost in the Middle: How Language Models Use Long Contexts [15/20] ? Language Models (Mostly) Know What They Know [16/20] ? Holistic Evaluation of Language Models [17/20] 804 Retrieval evaluation with incomplete information [18/20] 525 LETOR: A benchmark collection for research on learning to rank for inf [19/20] 17179 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [20/20] 7 Shifting from Ranking to Set Selection for Retrieval Augmented Generat
Build the results table
Convert records to a DataFrame and sort by citation count descending.
In [20]:
df = pd.DataFrame(records)
df["citations"] = pd.to_numeric(df["citations"], errors="coerce")
df["year"] = pd.to_numeric(df["year"], errors="coerce")
df = df.sort_values("citations", ascending=False).reset_index(drop=True)
resolved = df["citations"].notna().sum()
print(f"Done. {resolved}/{total} papers resolved.")
print(f"Total papers: {len(df)}")
print(f"Resolved: {resolved}")
print(f"Mean: {df['citations'].mean():.1f}")
print(f"Median: {df['citations'].median():.0f}")
print(f"Max: {df['citations'].max():.0f}")
print(f"Min: {df['citations'].min():.0f}")
Done. 15/20 papers resolved. Total papers: 20 Resolved: 15 Mean: 2064.3 Median: 525 Max: 17179 Min: 4
Quality flags
In [21]:
print(f"Below 10 citations: {(df['citations'] < 10).sum()}")
print(f"Below 5 citations: {(df['citations'] < 5).sum()}")
Below 10 citations: 2 Below 5 citations: 1
Top and bottom papers
In [24]:
print("Top 10:")
print(df[["title", "year", "citations"]].head(10))
print(f"\nBottom 10:")
print(df.dropna()[["title", "year", "citations"]].tail(2))
Top 10:
title year citations
0 Sentence-BERT: Sentence Embeddings using Siame... 2019 17179.0
1 The Probabilistic Relevance Framework: BM25 an... 2009 5033.0
2 Lost in the Middle: How Language Models Use Lo... 2024 3345.0
3 The use of MMR, diversity-based reranking for ... 1998 1620.0
4 Retrieval evaluation with incomplete information 2004 804.0
5 Multi-News: A Large-Scale Multi-Document Summa... 2019 700.0
6 Beyond independent relevance: methods and eval... 2003 565.0
7 LETOR: A benchmark collection for research on ... 2010 525.0
8 Large Language Models for Information Retrieva... 2025 525.0
9 Is ChatGPT Good at Search? Investigating Large... 2024 478.0
Bottom 10:
title year citations
13 Shifting from Ranking to Set Selection for Ret... 2025 7.0
14 R$^\textrm2$ LLMs: Retrieval and Ranking with ... 2025 4.0