Citation Retriever for BibTeX

Parse a .bib file, resolve citation counts from the Semantic Scholar API, and produce a ranked summary of your bibliography.

References

Semantic Scholar API: https://api.semanticscholar.org/
bibtexparser docs: https://bibtexparser.readthedocs.io/

Requirements

A .bib file (e.g. exported from Zotero, Mendeley, or LaTeX)
pip install bibtexparser requests pandas

In [23]:

import time
import bibtexparser
import requests
import pandas as pd

Configuration

Path to the .bib file and Semantic Scholar API endpoint. Adjust BIB_PATH to point to your bibliography.

In [17]:

BIB_PATH = "data/LIBRARY.bib"

S2_BASE   = "https://api.semanticscholar.org/graph/v1/paper"
S2_SEARCH = f"{S2_BASE}/search"
DELAY     = 0.5   # seconds between API calls to respect rate limits

print(f"BibTeX file    : {BIB_PATH}")
print(f"API base       : {S2_BASE}")
print(f"Rate-limit     : {DELAY}s between requests")

BibTeX file    : data/LIBRARY.bib
API base       : https://api.semanticscholar.org/graph/v1/paper
Rate-limit     : 0.5s between requests

Parse the BibTeX file

Load all entries and extract key metadata fields.

In [18]:

with open(BIB_PATH, "r", encoding="utf-8") as f:
    bib_db = bibtexparser.load(f)

print(f"Parsed entries : {len(bib_db.entries)}")

Parsed entries : 20

Resolve citation counts

For each entry, first try a DOI lookup. If that fails or the entry has no DOI, fall back to a title search. Results are collected into a list of records.

In [19]:

records = []
total   = len(bib_db.entries)

for i, entry in enumerate(bib_db.entries):
    title = entry.get("title", "").replace("{", "").replace("}", "")
    doi   = entry.get("doi", "")
    year  = entry.get("year", "")
    key   = entry.get("ID", "")
    citation_count = None

    # DOI lookup
    if doi:
        r = requests.get(
            f"{S2_BASE}/DOI:{doi}",
            params={"fields": "citationCount"},
        )
        if r.ok:
            citation_count = r.json().get("citationCount")

    # title search fallback
    if citation_count is None:
        r = requests.get(
            S2_SEARCH,
            params={"query": title, "fields": "citationCount", "limit": 1},
        )
        if r.ok:
            data = r.json().get("data", [])
            if data:
                citation_count = data[0].get("citationCount")

    records.append({
        "key":       key,
        "title":     title,
        "year":      year,
        "doi":       doi,
        "citations": citation_count,
    })

    print(f"[{i+1}/{total}] {citation_count or '?':>6}  {title[:70]}")
    time.sleep(DELAY)

[1/20]      4  R$^\textrm2$ LLMs: Retrieval and Ranking with LLMs
[2/20]   1620  The use of MMR, diversity-based reranking for reordering documents and
[3/20]   5033  The Probabilistic Relevance Framework: BM25 and Beyond
[4/20]    700  Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abs
[5/20]     12  Review on Query-focused Multi-document Summarization (QMDS) with Compa
[6/20]      ?  RankLLM: A Python Package for Reranking with LLMs
[7/20]    565  Beyond independent relevance: methods and evaluation metrics for subto
[8/20]    478  Is ChatGPT Good at Search? Investigating Large Language Models as Re-R
[9/20]     94  Zero-Shot Listwise Document Reranking with a Large Language Model
[10/20]     73  RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Lar
[11/20]    525  Large Language Models for Information Retrieval: A Survey
[12/20]      ?  Found in the Middle: Permutation Self-Consistency Improves Listwise Ra
[13/20]      ?  Document Ranking with a Pretrained Sequence-to-Sequence Model
[14/20]   3345  Lost in the Middle: How Language Models Use Long Contexts
[15/20]      ?  Language Models (Mostly) Know What They Know
[16/20]      ?  Holistic Evaluation of Language Models
[17/20]    804  Retrieval evaluation with incomplete information
[18/20]    525  LETOR: A benchmark collection for research on learning to rank for inf
[19/20]  17179  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
[20/20]      7  Shifting from Ranking to Set Selection for Retrieval Augmented Generat

Build the results table

Convert records to a DataFrame and sort by citation count descending.

In [20]:

df = pd.DataFrame(records)
df["citations"] = pd.to_numeric(df["citations"], errors="coerce")
df["year"]      = pd.to_numeric(df["year"], errors="coerce")
df = df.sort_values("citations", ascending=False).reset_index(drop=True)

resolved = df["citations"].notna().sum()

print(f"Done. {resolved}/{total} papers resolved.")
print(f"Total papers: {len(df)}")
print(f"Resolved:     {resolved}")
print(f"Mean:         {df['citations'].mean():.1f}")
print(f"Median:       {df['citations'].median():.0f}")
print(f"Max:          {df['citations'].max():.0f}")
print(f"Min:          {df['citations'].min():.0f}")

Done. 15/20 papers resolved.
Total papers: 20
Resolved:     15
Mean:         2064.3
Median:       525
Max:          17179
Min:          4

Quality flags

In [21]:

print(f"Below 10 citations: {(df['citations'] < 10).sum()}")
print(f"Below 5 citations:  {(df['citations'] < 5).sum()}")

Below 10 citations: 2
Below 5 citations:  1

Top and bottom papers

In [24]:

print("Top 10:")
print(df[["title", "year", "citations"]].head(10))

print(f"\nBottom 10:")
print(df.dropna()[["title", "year", "citations"]].tail(2))

Top 10:
                                               title  year  citations
0  Sentence-BERT: Sentence Embeddings using Siame...  2019    17179.0
1  The Probabilistic Relevance Framework: BM25 an...  2009     5033.0
2  Lost in the Middle: How Language Models Use Lo...  2024     3345.0
3  The use of MMR, diversity-based reranking for ...  1998     1620.0
4   Retrieval evaluation with incomplete information  2004      804.0
5  Multi-News: A Large-Scale Multi-Document Summa...  2019      700.0
6  Beyond independent relevance: methods and eval...  2003      565.0
7  LETOR: A benchmark collection for research on ...  2010      525.0
8  Large Language Models for Information Retrieva...  2025      525.0
9  Is ChatGPT Good at Search? Investigating Large...  2024      478.0

Bottom 10:
                                                title  year  citations
13  Shifting from Ranking to Set Selection for Ret...  2025        7.0
14  R$^\textrm2$ LLMs: Retrieval and Ranking with ...  2025        4.0

Citation Retriever for BibTeX

Parse a .bib file, resolve citation counts from the Semantic Scholar API, and produce a ranked summary of your bibliography.

References

Requirements

A .bib file (e.g. exported from Zotero, Mendeley, or LaTeX)

pip install bibtexparser requests pandas

BIB_PATH = "data/LIBRARY.bib" S2_BASE = "https://api.semanticscholar.org/graph/v1/paper" S2_SEARCH = f"{S2_BASE}/search" DELAY = 0.5 # seconds between API calls to respect rate limits print(f"BibTeX file : {BIB_PATH}") print(f"API base : {S2_BASE}") print(f"Rate-limit : {DELAY}s between requests")

records = [] total = len(bib_db.entries) for i, entry in enumerate(bib_db.entries): title = entry.get("title", "").replace("{", "").replace("}", "") doi = entry.get("doi", "") year = entry.get("year", "") key = entry.get("ID", "") citation_count = None # DOI lookup if doi: r = requests.get( f"{S2_BASE}/DOI:{doi}", params={"fields": "citationCount"}, ) if r.ok: citation_count = r.json().get("citationCount") # title search fallback if citation_count is None: r = requests.get( S2_SEARCH, params={"query": title, "fields": "citationCount", "limit": 1}, ) if r.ok: data = r.json().get("data", []) if data: citation_count = data[0].get("citationCount") records.append({ "key": key, "title": title, "year": year, "doi": doi, "citations": citation_count, }) print(f"[{i+1}/{total}] {citation_count or '?':>6} {title[:70]}") time.sleep(DELAY)

[1/20] 4 R$^\textrm2$ LLMs: Retrieval and Ranking with LLMs [2/20] 1620 The use of MMR, diversity-based reranking for reordering documents and [3/20] 5033 The Probabilistic Relevance Framework: BM25 and Beyond [4/20] 700 Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abs [5/20] 12 Review on Query-focused Multi-document Summarization (QMDS) with Compa [6/20] ? RankLLM: A Python Package for Reranking with LLMs [7/20] 565 Beyond independent relevance: methods and evaluation metrics for subto [8/20] 478 Is ChatGPT Good at Search? Investigating Large Language Models as Re-R [9/20] 94 Zero-Shot Listwise Document Reranking with a Large Language Model [10/20] 73 RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Lar [11/20] 525 Large Language Models for Information Retrieval: A Survey [12/20] ? Found in the Middle: Permutation Self-Consistency Improves Listwise Ra [13/20] ? Document Ranking with a Pretrained Sequence-to-Sequence Model [14/20] 3345 Lost in the Middle: How Language Models Use Long Contexts [15/20] ? Language Models (Mostly) Know What They Know [16/20] ? Holistic Evaluation of Language Models [17/20] 804 Retrieval evaluation with incomplete information [18/20] 525 LETOR: A benchmark collection for research on learning to rank for inf [19/20] 17179 Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [20/20] 7 Shifting from Ranking to Set Selection for Retrieval Augmented Generat

df = pd.DataFrame(records) df["citations"] = pd.to_numeric(df["citations"], errors="coerce") df["year"] = pd.to_numeric(df["year"], errors="coerce") df = df.sort_values("citations", ascending=False).reset_index(drop=True) resolved = df["citations"].notna().sum() print(f"Done. {resolved}/{total} papers resolved.") print(f"Total papers: {len(df)}") print(f"Resolved: {resolved}") print(f"Mean: {df['citations'].mean():.1f}") print(f"Median: {df['citations'].median():.0f}") print(f"Max: {df['citations'].max():.0f}") print(f"Min: {df['citations'].min():.0f}")

Top 10: title year citations 0 Sentence-BERT: Sentence Embeddings using Siame... 2019 17179.0 1 The Probabilistic Relevance Framework: BM25 an... 2009 5033.0 2 Lost in the Middle: How Language Models Use Lo... 2024 3345.0 3 The use of MMR, diversity-based reranking for ... 1998 1620.0 4 Retrieval evaluation with incomplete information 2004 804.0 5 Multi-News: A Large-Scale Multi-Document Summa... 2019 700.0 6 Beyond independent relevance: methods and eval... 2003 565.0 7 LETOR: A benchmark collection for research on ... 2010 525.0 8 Large Language Models for Information Retrieva... 2025 525.0 9 Is ChatGPT Good at Search? Investigating Large... 2024 478.0 Bottom 10: title year citations 13 Shifting from Ranking to Set Selection for Ret... 2025 7.0 14 R$^\textrm2$ LLMs: Retrieval and Ranking with ... 2025 4.0