Parthasarathi Singh
Parthasarathi Singh

Reputation: 3

Using BioPython, How To Print DOI References In A Single Line (Comma-Delimited) For A Given Pair of Search Terms, Instead Of In Multiple Lines?

People of StackOverflow, first of all, thanks for your patience. I understand this is my third thread on the subject, but as I'm getting nowhere, and I don't even know where to start (I don't know what I don't know), I thought I'd ask here anyway. I'm trying to pull references from PMC using Biopython, to write back into a CSV file, consisting of, among other things, the plant name, the associated disease/condition it cures/its medicinal action, and the DOI URLs that refer to the given plant-disease pair. After a lot of hours of trying to understand what to do, and discussing the code with people much more experienced than myself, this is what was finally typed in Visual Studio Code:

  for plant, disease in plant_disease_list:
    search_query = generate_search_query(plant, disease)
    handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
    record1 = Entrez.read(handle1)
    pubmed_ids = record1.get("IdList")
    if len(pubmed_ids)==0:
      print("{}, {}, None".format(plant, disease))
    else:
      for pubmed_id in pubmed_ids:
        handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
        records = Entrez.read(handle2)
        for record in records:
          doi = record.get("DOI")
          if doi is None:
           print(("{}, {}".format(plant, disease)))
          else:
            doi_main = doi.split()
            string = "http://doi.org/"
            to_add = (",").join((string + x) for x in doi_main)
            print("{}, {},".format(plant, disease), to_add, sep="")

where generate_search_query was previously defined as:

def generate_search_query(plant, disease):
  search_query = '"{}" AND "{}"'.format(plant, disease)
  return search_query

This is the output I'm getting:

Asystasia salicifalia, Puerperal illness, None
Asystasia salicifalia, Puerperium, None
Asystasia salicifalia, Puerperal disorder, None
Barleria strigosa, Tonic
Justicia procumbens, Lumbago, None
Justicia procumbens, Itching,http://doi.org/10.1673/031.012.0501
Strobilanthes auriculata, Malnutrition, None
Thunbergia laurifolia, Detoxificant, None
Thunbergia similis, Tonic, None
Lannea coromandelica, Dizziness,http://doi.org/10.3897/phytokeys.102.24380
Lannea coromandelica, Dizziness,http://doi.org/10.1186/s13002-016-0089-8
Lannea coromandelica, Dizziness,http://doi.org/10.1186/s13002-015-0033-3
Spondias pinnata, Flatulence,http://doi.org/10.1016/j.heliyon.2019.e02768
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-019-0287-2
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-018-0248-1
Spondias pinnata, Flatulence,http://doi.org/10.3897/phytokeys.102.24380
Spondias pinnata, Flatulence,http://doi.org/10.1155/2018/5382904
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-016-0089-8
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-015-0033-3
Spondias pinnata, Flatulence,http://doi.org/10.1186/1472-6882-13-243
Spondias pinnata, Flatulence,http://doi.org/10.1186/1472-6882-10-77
Holarrhena pubescens, Diarrhoea,http://doi.org/10.5455/javar.2019.f379
Holarrhena pubescens, Diarrhoea,http://doi.org/10.1155/2019/2321961
Holarrhena pubescens, Diarrhoea,http://doi.org/10.1186/s12906-018-2348-9
Traceback (most recent call last):
  File "scraperscript_python.py", line 33, in <module>
    handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\Bio\Entrez\__init__.py", line 334, in esummary
    return _open(cgi, variables)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\Bio\Entrez\__init__.py", line 569, in _open
    handle = _urlopen(cgi)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 543, in _open
    '_open', req)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
    result = func(*args)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1362, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1319, in do_open
    encode_chunked=req.has_header('Transfer-encoding'))
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1026, in _send_output
    self.send(msg)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 966, in send
    self.connect()
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1422, in connect
    server_hostname=server_hostname)
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 423, in wrap_socket
    session=session
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 870, in _create
    self.do_handshake()
  File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
KeyboardInterrupt

where the rest of the output has been interrupted by me, because I don't want it to run on the whole data, as it is printing it in the incorrect form. As you can see with the example of Spondias pinnata and flatulence, you can see it is printing the different DOI URLs in different lines. The problem is I don't want it to print like that, because it will be extremely difficult to put it back into the original data. This CSV file, for example, has only 65 entries, but there are datasets with more than 8000 entries, making it a very difficult job. The output I wish to achieve, should for example, look like this (when we consider the aforementioned plant-disease pair):

Spondias pinnata, Flatulence, http://doi.org/10.1016/j.heliyon.2019.e02768, http://doi.org/10.1186/s13002-019-0287-2, http://doi.org/10.1186/s13002-018-0248-1, http://doi.org/10.3897/phytokeys.102.24380, http://doi.org/10.1155/2018/5382904, http://doi.org/10.1186/s13002-016-0089-8, http://doi.org/10.1186/s13002-015-0033-3, http://doi.org/10.1186/1472-6882-13-243, http://doi.org/10.1186/1472-6882-10-77

Someone from my family suggested that I use a nested dictionary, but I don't see how/if that would help, and I have no idea where to place it in the code, and what changes to make to the already heavily nested loops. Any help with this would be greatly appreciated. Thank you.

Upvotes: 0

Views: 157

Answers (1)

BioGeek
BioGeek

Reputation: 22887

The following code:

from Bio import Entrez
import csv
Entrez.email = "[email protected]"


botanical_names = ['Asystasia salicifalia', 'Asystasia salicifalia', 'Asystasia salicifalia', 'Barleria strigosa', 'Justicia procumbens', 'Justicia procumbens', 'Strobilanthes auriculata', 'Thunbergia laurifolia', 'Thunbergia similis', 'Lannea coromandelica', 'Spondias pinnata']
diseases = ['Puerperal illness', 'Puerperium', 'Puerperal disorder', 'Tonic', 'Lumbago', 'Itching', 'Malnutrition', 'Detoxificant', 'Tonic', 'Dizziness', 'Flatulence']

assert len(botanical_names) == len(diseases)

plant_disease_list = zip(botanical_names, diseases)


with open('plant_diseases.csv', 'w', newline='') as csvfile:
    fieldnames = ['plant', 'disease', 'dois']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    for plant, disease in plant_disease_list:
      result = {'plant': plant,
                'disease': disease}
      search_query = '"{}" AND "{}"'.format(plant, disease)
      handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
      record1 = Entrez.read(handle1)
      pubmed_ids = record1.get("IdList")
      if pubmed_ids:
        handle2 = Entrez.esummary(db="pmc", id=','.join(pubmed_ids))
        records = Entrez.read(handle2)
        dois = [record.get("DOI") for record in records if record.get("DOI")  is not None]
        prefix = "http://doi.org/"
        dois = ','.join([prefix + doi for doi in dois])
        result['dois'] = dois
      writer.writerow(result)

writes the following output to the file plant_diseases.csv:

plant,disease,dois
Asystasia salicifalia,Puerperal illness,
Asystasia salicifalia,Puerperium,
Asystasia salicifalia,Puerperal disorder,
Barleria strigosa,Tonic,
Justicia procumbens,Lumbago,
Justicia procumbens,Itching,http://doi.org/10.1673/031.012.0501
Strobilanthes auriculata,Malnutrition,
Thunbergia laurifolia,Detoxificant,
Thunbergia similis,Tonic,
Lannea coromandelica,Dizziness,"http://doi.org/10.3897/phytokeys.102.24380,http://doi.org/10.1186/s13002-016-0089-8,http://doi.org/10.1186/s13002-015-0033-3"
Spondias pinnata,Flatulence,"http://doi.org/10.1016/j.heliyon.2019.e02768,http://doi.org/10.1186/s13002-019-0287-2,http://doi.org/10.1186/s13002-018-0248-1,http://doi.org/10.3897/phytokeys.102.24380,http://doi.org/10.1155/2018/5382904,http://doi.org/10.1186/s13002-016-0089-8,http://doi.org/10.1186/s13002-015-0033-3,http://doi.org/10.1186/1472-6882-13-243,http://doi.org/10.1186/1472-6882-10-77"

Note that I have used the csv module to create valid CSV files. This includes adding double qoutes around your comma seperated list of DOIs to seperate them from the comma you use to delineate the plant and the disease. Also, there is no need to add a None placeholder if you have no DOIs. Since the first line contains a header, the csv module knows that there it should look for three fields per row.

Also, don't use string as a variable name, because it is the name of a Python module in the standard library.

Upvotes: 1

Related Questions