Carlee B
Carlee B

Reputation: 11

I'm having difficulty using Beautiful Soup to scrape data from an NCBI website

I can't for the life of me figure out how to use beautiful soup to scrape the isolation source information from web pages such as this: https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/

I keep trying to check if that tag exists and it keep returning that it doesn't, when I know for a fact it does. If I can't even verify it exists I'm not sure how to scrape it.

Thanks!

Upvotes: 1

Views: 512

Answers (2)

Pierre
Pierre

Reputation: 35226

you shouldn' scrape the ncbi when there is the NCBI-EUtilities web service.

wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JOKX00000000.2&rettype=gb&retmode=xml" | xmllint --xpath '//GBQualifier[GBQualifier_name="isolation_source"]/GBQualifier_value/text()' - && echo

Type II sourdough

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

The data is loaded from external URL. To get isolation_source, you can use this example:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.ncbi.nlm.nih.gov/nuccore/JOKX00000000.2/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
ncbi_uidlist = soup.select_one('[name="ncbi_uidlist"]')["content"]

api_url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi"

params = {
    "id": ncbi_uidlist,
    "db": "nuccore",
    "report": "genbank",
    "extrafeat": "null",
    "conwithfeat": "on",
    "hide-cdd": "on",
    "retmode": "html",
    "withmarkup": "on",
    "tool": "portal",
    "log$": "seqview",
    "maxdownloadsize": "1000000",
}

soup = BeautifulSoup(
    requests.get(api_url, params=params).content, "html.parser"
)
features = soup.select_one(".feature").text

isolation_source = re.search(r'isolation_source="([^"]+)"', features).group(1)
print(features)
print("-" * 80)
print(isolation_source)

Prints:

     source          1..12
                     /organism="Limosilactobacillus reuteri"
                     /mol_type="genomic DNA"
                     /strain="TMW1.112"
                     /isolation_source="Type II sourdough"
                     /db_xref="taxon:1598"
                     /country="Germany"
                     /collection_date="1998"

--------------------------------------------------------------------------------
Type II sourdough

Upvotes: 0

Related Questions