Reputation: 75
the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.
I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.
What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?
A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.
The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).
I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.
Upvotes: 0
Views: 291
Reputation: 2129
Here is the solution...
import re
import urllib
from bs4 import BeautifulSoup
def get_growth_temp(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
if no_hits > 1 :
letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
all_urls = []
for i in letters:
all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
max_temp = []
for ind_url in all_urls:
soup = BeautifulSoup(urllib.urlopen(ind_url).read())
a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
if a:
max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
print "Recommended growth temperature : %d °C:\t" % max(max_temp)
url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'
if __name__ == "__main__":
# TO Open file then iterate thru the urls/bacterias
# with open('file.txt', 'rU') as f:
# for url in f:
# get_growth_temp(url)
get_growth_temp(url)
Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.
Hope it helped you.. Thanks
Upvotes: 2