Ctrl K

Citation Retriever for BibTeX

Parse a .bib file, resolve citation counts from the Semantic Scholar API, and produce a ranked summary of your bibliography.

References

  • Semantic Scholar API: https://api.semanticscholar.org/
  • bibtexparser docs: https://bibtexparser.readthedocs.io/

Requirements

  • A .bib file (e.g. exported from Zotero, Mendeley, or LaTeX)
  • pip install bibtexparser requests pandas
In [23]:
import time
import bibtexparser
import requests
import pandas as pd

Configuration

Path to the .bib file and Semantic Scholar API endpoint. Adjust BIB_PATH to point to your bibliography.

In [17]:
BIB_PATH = "data/LIBRARY.bib"

S2_BASE   = "https://api.semanticscholar.org/graph/v1/paper"
S2_SEARCH = f"{S2_BASE}/search"
DELAY     = 0.5   # seconds between API calls to respect rate limits

print(f"BibTeX file    : {BIB_PATH}")
print(f"API base       : {S2_BASE}")
print(f"Rate-limit     : {DELAY}s between requests")
BibTeX file    : data/LIBRARY.bib
API base       : https://api.semanticscholar.org/graph/v1/paper
Rate-limit     : 0.5s between requests

Parse the BibTeX file

Load all entries and extract key metadata fields.

In [18]:
with open(BIB_PATH, "r", encoding="utf-8") as f:
    bib_db = bibtexparser.load(f)

print(f"Parsed entries : {len(bib_db.entries)}")
Parsed entries : 20

Resolve citation counts

For each entry, first try a DOI lookup. If that fails or the entry has no DOI, fall back to a title search. Results are collected into a list of records.

In [19]:
records = []
total   = len(bib_db.entries)

for i, entry in enumerate(bib_db.entries):
    title = entry.get("title", "").replace("{", "").replace("}", "")
    doi   = entry.get("doi", "")
    year  = entry.get("year", "")
    key   = entry.get("ID", "")
    citation_count = None

    # DOI lookup
    if doi:
        r = requests.get(
            f"{S2_BASE}/DOI:{doi}",
            params={"fields": "citationCount"},
        )
        if r.ok:
            citation_count = r.json().get("citationCount")

    # title search fallback
    if citation_count is None:
        r = requests.get(
            S2_SEARCH,
            params={"query": title, "fields": "citationCount", "limit": 1},
        )
        if r.ok:
            data = r.json().get("data", [])
            if data:
                citation_count = data[0].get("citationCount")

    records.append({
        "key":       key,
        "title":     title,
        "year":      year,
        "doi":       doi,
        "citations": citation_count,
    })

    print(f"[{i+1}/{total}] {citation_count or '?':>6}  {title[:70]}")
    time.sleep(DELAY)
[1/20]      4  R$^\textrm2$ LLMs: Retrieval and Ranking with LLMs
[2/20]   1620  The use of MMR, diversity-based reranking for reordering documents and
[3/20]   5033  The Probabilistic Relevance Framework: BM25 and Beyond
[4/20]    700  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abs
[5/20]     12  Review on Query-focused Multi-document Summarization (QMDS) with Compa
[6/20]      ?  RankLLM: A Python Package for Reranking with LLMs
[7/20]    565  Beyond independent relevance: methods and evaluation metrics for subto
[8/20]    478  Is ChatGPT Good at Search? Investigating Large Language Models as Re-R
[9/20]     94  Zero-Shot Listwise Document Reranking with a Large Language Model
[10/20]     73  RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Lar
[11/20]    525  Large Language Models for Information Retrieval: A Survey
[12/20]      ?  Found in the Middle: Permutation Self-Consistency Improves Listwise Ra
[13/20]      ?  Document Ranking with a Pretrained Sequence-to-Sequence Model
[14/20]   3345  Lost in the Middle: How Language Models Use Long Contexts
[15/20]      ?  Language Models (Mostly) Know What They Know
[16/20]      ?  Holistic Evaluation of Language Models
[17/20]    804  Retrieval evaluation with incomplete information
[18/20]    525  LETOR: A benchmark collection for research on learning to rank for inf
[19/20]  17179  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
[20/20]      7  Shifting from Ranking to Set Selection for Retrieval Augmented Generat

Build the results table

Convert records to a DataFrame and sort by citation count descending.

In [20]:
df = pd.DataFrame(records)
df["citations"] = pd.to_numeric(df["citations"], errors="coerce")
df["year"]      = pd.to_numeric(df["year"], errors="coerce")
df = df.sort_values("citations", ascending=False).reset_index(drop=True)

resolved = df["citations"].notna().sum()

print(f"Done. {resolved}/{total} papers resolved.")
print(f"Total papers: {len(df)}")
print(f"Resolved:     {resolved}")
print(f"Mean:         {df['citations'].mean():.1f}")
print(f"Median:       {df['citations'].median():.0f}")
print(f"Max:          {df['citations'].max():.0f}")
print(f"Min:          {df['citations'].min():.0f}")
Done. 15/20 papers resolved.
Total papers: 20
Resolved:     15
Mean:         2064.3
Median:       525
Max:          17179
Min:          4

Quality flags

In [21]:
print(f"Below 10 citations: {(df['citations'] < 10).sum()}")
print(f"Below 5 citations:  {(df['citations'] < 5).sum()}")
Below 10 citations: 2
Below 5 citations:  1

Top and bottom papers

In [24]:
print("Top 10:")
print(df[["title", "year", "citations"]].head(10))

print(f"\nBottom 10:")
print(df.dropna()[["title", "year", "citations"]].tail(2))
Top 10:
                                               title  year  citations
0  Sentence-BERT: Sentence Embeddings using Siame...  2019    17179.0
1  The Probabilistic Relevance Framework: BM25 an...  2009     5033.0
2  Lost in the Middle: How Language Models Use Lo...  2024     3345.0
3  The use of MMR, diversity-based reranking for ...  1998     1620.0
4   Retrieval evaluation with incomplete information  2004      804.0
5  Multi-News: A Large-Scale Multi-Document Summa...  2019      700.0
6  Beyond independent relevance: methods and eval...  2003      565.0
7  LETOR: A benchmark collection for research on ...  2010      525.0
8  Large Language Models for Information Retrieva...  2025      525.0
9  Is ChatGPT Good at Search? Investigating Large...  2024      478.0

Bottom 10:
                                                title  year  citations
13  Shifting from Ranking to Set Selection for Ret...  2025        7.0
14  R$^\textrm2$ LLMs: Retrieval and Ranking with ...  2025        4.0