Mohan Pradhan
Mohan Pradhan

Reputation: 1

Extracting titles of articles deposited in pubmed from Pubmed IDs using bioentrez

I am trying to extract titles for some articles deposited in Pubmed using Pubmed IDs (that I have in a list called 'ids'). There are around 650K Pubmed IDs. The code seems to work fine and doesn't throw any errors. But the code extracts tiles only for a fraction of articles and not all.

Following is the code:

Entrez.email = "[email protected]"

for i in range(0,len(ids),10000):

    if i%10000 == 0:   # for me to track the progress of the script
        print (i)

    idlist=ids[i:i+10000]
    handle = Entrez.efetch(db="pubmed", id=idlist, retmode="xml")

    try:
        record = Entrez.read(handle)
    except:
        continue

    title={}

    for j in range(len(record["PubmedArticle"])):

        pmid=record["PubmedArticle"][j]['MedlineCitation']['PMID'][:]
        if "Abstract" in record["PubmedArticle"][j]['MedlineCitation']['Article'].keys():
            title[pmid]=record["PubmedArticle"][j]['MedlineCitation']['Article']['ArticleTitle'].encode('ascii', 'ignore').decode('ascii')

    # save article titles
    subfile='article_titles_'+str(i)+'.txt'
    ar = pd.DataFrame.from_dict(title, orient="index")
    ar.to_csv(subfile,sep="\t",header=None)

Any suggestions will be useful. Thanks

Upvotes: 0

Views: 434

Answers (1)

cnluzon
cnluzon

Reputation: 1084

I cannot reproduce your example because I do not have your Pubmed ID list. It would be also interesting to know how many of the ~650K IDs you recover, it's not the same if you are recovering 639K titles (probably some of your IDs are simply missing), or 10K. I have tried a mini example myself and it does retrieve the titles. I think maybe some of the IDs are not valid. You could try to do smaller batches and also:

  1. This except: continue will hide any problems that may have arisen from an empty handle (if the query result was empty). I would try to check the exceptions.

  2. Throw a warning if len(record["PubmedArticle"]) is smaller than your batch size. This way you can narrow down the IDs you may be missing.

  3. You are only adding the title to your title dict if the registry has an Abstract field. Are you sure this is the case for all the records? The cases I tried it did apply but maybe not all entries have this.

Upvotes: 1

Related Questions