carl
carl

Reputation: 4426

How to get the latest papers from pubmed

This is a bit of a specific question, but somebody must have done this before. I would like to get the latest papers from pubmed. Not papers about a certain subjects, but all of them. I thought to query depending on modification date (mdat). I use biopython.py and my code looks like this

handle = Entrez.egquery(mindate='2015/01/10',maxdate='2017/02/19',datetype='mdat')
results = Entrez.read(handle)
for row in results["eGQueryResult"]:
        if row["DbName"]=="nuccore":
            print(row["Count"])

However, this results in zero papers. If I add term='cancer' I get heaps of papers. So the query seems to need the term keyword... but I want all papers, not papers on a certain subjects. Any ideas how to do this? thanks carl

Upvotes: 1

Views: 639

Answers (2)

mitoRibo
mitoRibo

Reputation: 4548

EDIT for python3. The idea is that the latest pubmed id is the same thing as the latest paper (which I'm not sure is true). Basically does a binary search for the latest PMID, then gives a list of the n most recent. This does not look at dates, and only returns PMIDs.

There is an issue however where not all PMIDs exist, for example https://pubmed.ncbi.nlm.nih.gov/34078719/ exists, https://pubmed.ncbi.nlm.nih.gov/34078720/ does not (retraction?), and https://pubmed.ncbi.nlm.nih.gov/34078721/ exists. This ruins the binary search since it can't know if it's found a PMID that hasn't been used yet, or if it has found one that has previously existed.

CODE:

import urllib

def pmid_exists(pmid):
    url_stem = 'https://www.ncbi.nlm.nih.gov/pubmed/'
    query = url_stem+str(pmid)
    try:
        request = urllib.request.urlopen(query)
        return True
    except urllib.error.HTTPError:
        return False


def get_latest_pmid(guess = 27239557, _min_guess=None, _max_guess=None):
    #print(_min_guess,'<=',guess,'<=',_max_guess)
    
    if _min_guess and _max_guess and _max_guess-_min_guess <= 1:
        #recursive base case, this guess must be the largest PMID
        return guess
    elif pmid_exists(guess):
        #guess PMID exists, search for larger ids
        _min_guess = guess
        next_guess = (_min_guess+_max_guess)//2 if _max_guess else guess*2
    else:
        #guess PMID does not exist, search for smaller ids
        _max_guess = guess
        next_guess = (_min_guess+_max_guess)//2 if _min_guess else guess//2
        
    return get_latest_pmid(next_guess, _min_guess, _max_guess)

#Start of program

n = 5
latest_pmid = get_latest_pmid()
most_recent_n_pmids = range(latest_pmid-n, latest_pmid)
print(most_recent_n_pmids)

OUTPUT:

[28245638, 28245639, 28245640, 28245641, 28245642]

Upvotes: 3

BioGeek
BioGeek

Reputation: 22827

term is a required parameter, so you can't omit it in your call to Entrez.egquery.

If you need all the papers within a specified timeframe, you will probably need a local copy of MEDLINE and PubMed Central:

For MEDLINE, this involves getting a license. For PubMed Central, you can download the Open Access subset without a license by ftp.

Upvotes: 3

Related Questions