Bioinformatics : Programmatic Access to the BacDive Database

Question

the resource at "BacDive" - ( http://bacdive.dsmz.de/) is a highly useful database for accessing bacterial knowledge, such as strain information, species information and parameters such as growth temperature optimums.

I have a scenario in which I have a set of organism names in a plain text file, and I would like to programmatically search them 1 by 1 against the Bacdive database (which doesnt allow a flat file to be downloaded) and retrieve the relevent information and populate my text file accordingly.

What are the main modules (such as beautifulsoups) that I would need to accomplish this? Is it straight forward? Is it allowed to programmatically access webpages ? Do I need permission?

A bacteria name would be "Pseudomonas putida" . Searching this would give 60 hits on bacdive. Clicking one of the hits, takes us to the specific page, where the line : "Growth temperature: [Ref.: #27] Recommended growth temperature : 26 °C " is the most important.

The script would have to access bacdive (which i have tried accessing using requests, but I feel they do not allow programmatic access, I have asked the moderator about this, and they said I should register for their API first).

I now have the API access. This is the page (http://www.bacdive.dsmz.de/api/bacdive/). This may seem quite simple to people who do HTML scraping, but I am not sure what to do now that I have access to the API.

Chandu · Accepted Answer

Here is the solution...

import re
import urllib
from bs4 import BeautifulSoup


def get_growth_temp(url):
    soup = BeautifulSoup(urllib.urlopen(url).read())
    no_hits = int(map(float, re.findall(r'[+-]?[0-9]+',str(soup.find_all("span", class_="searchresultlayerhits"))))[0])
    if no_hits > 1 :
        letters = soup.find_all("li", class_="searchresultrow1") + soup.find_all("li", class_="searchresultrow2")
    all_urls = []
    for i in letters:
        all_urls.append('http://bacdive.dsmz.de/index.php' + i.a["href"])
    max_temp = []
    for ind_url in all_urls:
        soup = BeautifulSoup(urllib.urlopen(ind_url).read())
        a = soup.body.findAll(text=re.compile('Recommended growth temperature :'))
       if a:
           max_temp.append(int(map(float, re.findall(r'[+-]?[0-9]+', str(a)))[0]))
    print "Recommended growth temperature : %d °C:	" % max(max_temp)

url = 'http://bacdive.dsmz.de/index.php?search=Pseudomonas+putida'

if __name__ == "__main__":
        # TO Open file then iterate thru the urls/bacterias 
        # with open('file.txt', 'rU') as f:
        #   for url in f:
        #      get_growth_temp(url)
     get_growth_temp(url)

Edit:

Here I am passing single url. if you want to pass multiple urls to get their growth temperature. call the function(url) by opening file. code is commented.

Hope it helped you.. Thanks

Bioinformatics : Programmatic Access to the BacDive Database

Answers (1)

Edit:

Related Questions