user2535338
user2535338

Reputation: 385

How to download pubmed articles and read them?

Im having trouble to save pubmed articles and read them. I've seen at this page here that there are some special files types but no one of them worked for me. I want to save them in a way that I can continuous using the keys to get the the data. I don't know if its possible use it if I save it as a text file. My code is this one:

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
    def __init__(self):
        Entrez.email='[email protected]'
        self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')

    '''Metodo 4 ler dado em forma de texto.'''  
    def saveArticlesFilesInXMLMode(self,dbs, ids):
        net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
        directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
        # if not os.path.exists(directory):
        # os.makedirs(directory)
        # filename = directory + '/'
        # if not os.path.exists(filename):
        out_handle = open(directory, "w+")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved")
        print("Parsing...")
        record = SeqIO.read(directory, "fasta")
        print(record)
        return(record.read())

I'm getting this error: ValueError: No records found in handle Pease someone can help me?


Now my code is like this, I am trying to do a function to save in .fasta like you did. And one to read the .fasta files like in the answer above.

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

def save_Articles_Files(dbName, idNum, rettypeName):
    net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
    filename = path  + idNum + ".fasta"
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")
enter code here

Entrez.email='[email protected]'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)

But my function is not working I need some help please!

Upvotes: 3

Views: 992

Answers (1)

BioGeek
BioGeek

Reputation: 22917

You're mixing up two concepts.

1) Entrez.efetch() is used to access NCBI. In your case you are downloading an article from Pubmed. The result that you get from net_handle.read() looks like:

PMID- 26837606
OWN - NLM
STAT- In-Process
DA  - 20160203
LR  - 20160210
IS  - 2045-2322 (Electronic)
IS  - 2045-2322 (Linking)
VI  - 6
DP  - 2016 Feb 03
TI  - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
PG  - 20315
LID - 10.1038/srep20315 [doi]
AB  - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted 
      genome modification in eukaryotic organisms from yeast to human cell lines. Its
      successful application in several plant species promises enormous potential for
      basic and applied plant research. However, extensive studies are still needed to 
      assess this system in other important plant species, to broaden its fields of
      application and to improve methods. Here we showed that the CRISPR/Cas9 system is
      efficient in petunia (Petunia hybrid), an important ornamental plant and a model 
      for comparative research. When PDS was used as target gene, transgenic shoot
      lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
      Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
      readily generated and identified in the first generation. A sequential
      transformation strategy--introducing Cas9 and sgRNA expression cassettes
      sequentially into petunia--can be used to make targeted mutations with short
      indels or chromosomal fragment deletions. Our results present a new plant species
      amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
      exploitation.
FAU - Zhang, Bin
AU  - Zhang B
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Yang, Xia
AU  - Yang X
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Yang, Chunping
AU  - Yang C
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Li, Mingyang
AU  - Li M
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Guo, Yulong
AU  - Guo Y
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
LA  - eng
PT  - Journal Article
PT  - Research Support, Non-U.S. Gov't
DEP - 20160203
PL  - England
TA  - Sci Rep
JT  - Scientific reports
JID - 101563288
SB  - IM
PMC - PMC4738242
OID - NLM: PMC4738242
EDAT- 2016/02/04 06:00
MHDA- 2016/02/04 06:00
CRDT- 2016/02/04 06:00
PHST- 2015/09/21 [received]
PHST- 2015/12/30 [accepted]
AID - srep20315 [pii]
AID - 10.1038/srep20315 [doi]
PST - epublish
SO  - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.

2) SeqIO.read() is used to read and parse FASTA files. This is a format that is used to store sequences. A sequence in FASTA format is represented as a series of lines. The first line in a FASTA file starts with a ">" (greater-than) symbol. Following the initial line (used for a unique description of the sequence) is the actual sequence itself in standard one-letter code.

As you can see, the result that you get back from Entrez.efetch() (which I pasted above) doesn't look like a FASTA file. So SeqIO.read() gives the error that it can't find any sequence records in the file.

Upvotes: 3

Related Questions