Reputation: 11
I can't for the life of me figure out how to use beautiful soup to scrape the isolation source information from web pages such as this: https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/
I keep trying to check if that tag exists and it keep returning that it doesn't, when I know for a fact it does. If I can't even verify it exists I'm not sure how to scrape it.
Thanks!
Upvotes: 1
Views: 512
Reputation: 35226
you shouldn' scrape the ncbi when there is the NCBI-EUtilities web service.
wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JOKX00000000.2&rettype=gb&retmode=xml" | xmllint --xpath '//GBQualifier[GBQualifier_name="isolation_source"]/GBQualifier_value/text()' - && echo
Type II sourdough
Upvotes: 1
Reputation: 195408
The data is loaded from external URL. To get isolation_source
, you can use this example:
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
ncbi_uidlist = soup.select_one('[name="ncbi_uidlist"]')["content"]
api_url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi"
params = {
"id": ncbi_uidlist,
"db": "nuccore",
"report": "genbank",
"extrafeat": "null",
"conwithfeat": "on",
"hide-cdd": "on",
"retmode": "html",
"withmarkup": "on",
"tool": "portal",
"log$": "seqview",
"maxdownloadsize": "1000000",
}
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
features = soup.select_one(".feature").text
isolation_source = re.search(r'isolation_source="([^"]+)"', features).group(1)
print(features)
print("-" * 80)
print(isolation_source)
Prints:
source 1..12
/organism="Limosilactobacillus reuteri"
/mol_type="genomic DNA"
/strain="TMW1.112"
/isolation_source="Type II sourdough"
/db_xref="taxon:1598"
/country="Germany"
/collection_date="1998"
--------------------------------------------------------------------------------
Type II sourdough
Upvotes: 0