bugcracker
bugcracker

Reputation: 13

How to extract content from webpage as seen from browser using python

I am trying to extract the data that is on this website "https://www.ncbi.nlm.nih.gov/nucleotide/209750423?report=genbank#". When I use urllib to extract the content, I am able to extract data that which I get by choosing 'view page source' after right-clicking on browser, but what I want is the actual sequence 'atggctgaga tgaaaaacct gaaaattgag gtggtgcgct ataacccgga....' to be extracted which is visible by right-clicking on browser and selecting 'inspect element' but not through 'view page source'

The code which I am using is

f = open('out.html', 'w') 
response = urllib.urlopen("https://www.ncbi.nlm.nih.gov/nucleotide/209750423?report=genbank")   
f.write(response.read())
f.close()

Upvotes: 0

Views: 756

Answers (2)

Mesut GUNES
Mesut GUNES

Reputation: 7401

Data are loaded by js so you can get the data below:

import requests
from pyquery import PyQuery

r = requests.get("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=209750423&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log$=seqview&maxplex=3&maxdownloadsize=1000000")
pq = PyQuery(r.content)
div = pq(".ff_line")

data = []
for d in div:
    data.append(d.text)

print data

Upvotes: 1

spectras
spectras

Reputation: 13542

You should take the time to actually look at the page you want to scrape. It's just a page that loads some JS application. The application then loads the actual data from another place.

https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=209750423&db=nuccore&dopt=genbank&retmode=text

By the way, be sure to check copyright issues before scraping online content.

Upvotes: 0

Related Questions