Reputation: 3
People of StackOverflow, first of all, thanks for your patience. I understand this is my third thread on the subject, but as I'm getting nowhere, and I don't even know where to start (I don't know what I don't know), I thought I'd ask here anyway. I'm trying to pull references from PMC using Biopython, to write back into a CSV file, consisting of, among other things, the plant name, the associated disease/condition it cures/its medicinal action, and the DOI URLs that refer to the given plant-disease pair. After a lot of hours of trying to understand what to do, and discussing the code with people much more experienced than myself, this is what was finally typed in Visual Studio Code:
for plant, disease in plant_disease_list:
search_query = generate_search_query(plant, disease)
handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
record1 = Entrez.read(handle1)
pubmed_ids = record1.get("IdList")
if len(pubmed_ids)==0:
print("{}, {}, None".format(plant, disease))
else:
for pubmed_id in pubmed_ids:
handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
records = Entrez.read(handle2)
for record in records:
doi = record.get("DOI")
if doi is None:
print(("{}, {}".format(plant, disease)))
else:
doi_main = doi.split()
string = "http://doi.org/"
to_add = (",").join((string + x) for x in doi_main)
print("{}, {},".format(plant, disease), to_add, sep="")
where generate_search_query was previously defined as:
def generate_search_query(plant, disease):
search_query = '"{}" AND "{}"'.format(plant, disease)
return search_query
This is the output I'm getting:
Asystasia salicifalia, Puerperal illness, None
Asystasia salicifalia, Puerperium, None
Asystasia salicifalia, Puerperal disorder, None
Barleria strigosa, Tonic
Justicia procumbens, Lumbago, None
Justicia procumbens, Itching,http://doi.org/10.1673/031.012.0501
Strobilanthes auriculata, Malnutrition, None
Thunbergia laurifolia, Detoxificant, None
Thunbergia similis, Tonic, None
Lannea coromandelica, Dizziness,http://doi.org/10.3897/phytokeys.102.24380
Lannea coromandelica, Dizziness,http://doi.org/10.1186/s13002-016-0089-8
Lannea coromandelica, Dizziness,http://doi.org/10.1186/s13002-015-0033-3
Spondias pinnata, Flatulence,http://doi.org/10.1016/j.heliyon.2019.e02768
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-019-0287-2
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-018-0248-1
Spondias pinnata, Flatulence,http://doi.org/10.3897/phytokeys.102.24380
Spondias pinnata, Flatulence,http://doi.org/10.1155/2018/5382904
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-016-0089-8
Spondias pinnata, Flatulence,http://doi.org/10.1186/s13002-015-0033-3
Spondias pinnata, Flatulence,http://doi.org/10.1186/1472-6882-13-243
Spondias pinnata, Flatulence,http://doi.org/10.1186/1472-6882-10-77
Holarrhena pubescens, Diarrhoea,http://doi.org/10.5455/javar.2019.f379
Holarrhena pubescens, Diarrhoea,http://doi.org/10.1155/2019/2321961
Holarrhena pubescens, Diarrhoea,http://doi.org/10.1186/s12906-018-2348-9
Traceback (most recent call last):
File "scraperscript_python.py", line 33, in <module>
handle2 = Entrez.esummary(db="pmc", id=pubmed_id)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\Bio\Entrez\__init__.py", line 334, in esummary
return _open(cgi, variables)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\site-packages\Bio\Entrez\__init__.py", line 569, in _open
handle = _urlopen(cgi)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 543, in _open
'_open', req)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1362, in https_open
context=self._context, check_hostname=self._check_hostname)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1319, in do_open
encode_chunked=req.has_header('Transfer-encoding'))
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 966, in send
self.connect()
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\http\client.py", line 1422, in connect
server_hostname=server_hostname)
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 423, in wrap_socket
session=session
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 870, in _create
self.do_handshake()
File "C:\Users\ASUS\AppData\Local\Programs\Python\Python37\lib\ssl.py", line 1139, in do_handshake
self._sslobj.do_handshake()
KeyboardInterrupt
where the rest of the output has been interrupted by me, because I don't want it to run on the whole data, as it is printing it in the incorrect form. As you can see with the example of Spondias pinnata and flatulence, you can see it is printing the different DOI URLs in different lines. The problem is I don't want it to print like that, because it will be extremely difficult to put it back into the original data. This CSV file, for example, has only 65 entries, but there are datasets with more than 8000 entries, making it a very difficult job. The output I wish to achieve, should for example, look like this (when we consider the aforementioned plant-disease pair):
Spondias pinnata, Flatulence, http://doi.org/10.1016/j.heliyon.2019.e02768, http://doi.org/10.1186/s13002-019-0287-2, http://doi.org/10.1186/s13002-018-0248-1, http://doi.org/10.3897/phytokeys.102.24380, http://doi.org/10.1155/2018/5382904, http://doi.org/10.1186/s13002-016-0089-8, http://doi.org/10.1186/s13002-015-0033-3, http://doi.org/10.1186/1472-6882-13-243, http://doi.org/10.1186/1472-6882-10-77
Someone from my family suggested that I use a nested dictionary, but I don't see how/if that would help, and I have no idea where to place it in the code, and what changes to make to the already heavily nested loops. Any help with this would be greatly appreciated. Thank you.
Upvotes: 0
Views: 157
Reputation: 22887
The following code:
from Bio import Entrez
import csv
Entrez.email = "[email protected]"
botanical_names = ['Asystasia salicifalia', 'Asystasia salicifalia', 'Asystasia salicifalia', 'Barleria strigosa', 'Justicia procumbens', 'Justicia procumbens', 'Strobilanthes auriculata', 'Thunbergia laurifolia', 'Thunbergia similis', 'Lannea coromandelica', 'Spondias pinnata']
diseases = ['Puerperal illness', 'Puerperium', 'Puerperal disorder', 'Tonic', 'Lumbago', 'Itching', 'Malnutrition', 'Detoxificant', 'Tonic', 'Dizziness', 'Flatulence']
assert len(botanical_names) == len(diseases)
plant_disease_list = zip(botanical_names, diseases)
with open('plant_diseases.csv', 'w', newline='') as csvfile:
fieldnames = ['plant', 'disease', 'dois']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for plant, disease in plant_disease_list:
result = {'plant': plant,
'disease': disease}
search_query = '"{}" AND "{}"'.format(plant, disease)
handle1 = Entrez.esearch(db="pmc", term=search_query, retmax="10")
record1 = Entrez.read(handle1)
pubmed_ids = record1.get("IdList")
if pubmed_ids:
handle2 = Entrez.esummary(db="pmc", id=','.join(pubmed_ids))
records = Entrez.read(handle2)
dois = [record.get("DOI") for record in records if record.get("DOI") is not None]
prefix = "http://doi.org/"
dois = ','.join([prefix + doi for doi in dois])
result['dois'] = dois
writer.writerow(result)
writes the following output to the file plant_diseases.csv
:
plant,disease,dois
Asystasia salicifalia,Puerperal illness,
Asystasia salicifalia,Puerperium,
Asystasia salicifalia,Puerperal disorder,
Barleria strigosa,Tonic,
Justicia procumbens,Lumbago,
Justicia procumbens,Itching,http://doi.org/10.1673/031.012.0501
Strobilanthes auriculata,Malnutrition,
Thunbergia laurifolia,Detoxificant,
Thunbergia similis,Tonic,
Lannea coromandelica,Dizziness,"http://doi.org/10.3897/phytokeys.102.24380,http://doi.org/10.1186/s13002-016-0089-8,http://doi.org/10.1186/s13002-015-0033-3"
Spondias pinnata,Flatulence,"http://doi.org/10.1016/j.heliyon.2019.e02768,http://doi.org/10.1186/s13002-019-0287-2,http://doi.org/10.1186/s13002-018-0248-1,http://doi.org/10.3897/phytokeys.102.24380,http://doi.org/10.1155/2018/5382904,http://doi.org/10.1186/s13002-016-0089-8,http://doi.org/10.1186/s13002-015-0033-3,http://doi.org/10.1186/1472-6882-13-243,http://doi.org/10.1186/1472-6882-10-77"
Note that I have used the csv
module to create valid CSV files. This includes adding double qoutes around your comma seperated list of DOIs to seperate them from the comma you use to delineate the plant and the disease. Also, there is no need to add a None placeholder if you have no DOIs. Since the first line contains a header, the csv
module knows that there it should look for three fields per row.
Also, don't use string
as a variable name, because it is the name of a Python module in the standard library.
Upvotes: 1