Reputation: 13
I am trying to extract the data that is on this website "https://www.ncbi.nlm.nih.gov/nucleotide/209750423?report=genbank#". When I use urllib to extract the content, I am able to extract data that which I get by choosing 'view page source' after right-clicking on browser, but what I want is the actual sequence 'atggctgaga tgaaaaacct gaaaattgag gtggtgcgct ataacccgga....' to be extracted which is visible by right-clicking on browser and selecting 'inspect element' but not through 'view page source'
The code which I am using is
f = open('out.html', 'w')
response = urllib.urlopen("https://www.ncbi.nlm.nih.gov/nucleotide/209750423?report=genbank")
f.write(response.read())
f.close()
Upvotes: 0
Views: 756
Reputation: 7401
Data are loaded by js so you can get the data below:
import requests
from pyquery import PyQuery
r = requests.get("https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=209750423&db=nuccore&dopt=genbank&extrafeat=976&fmt_mask=0&retmode=html&withmarkup=on&log$=seqview&maxplex=3&maxdownloadsize=1000000")
pq = PyQuery(r.content)
div = pq(".ff_line")
data = []
for d in div:
data.append(d.text)
print data
Upvotes: 1
Reputation: 13542
You should take the time to actually look at the page you want to scrape. It's just a page that loads some JS application. The application then loads the actual data from another place.
https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?val=209750423&db=nuccore&dopt=genbank&retmode=text
By the way, be sure to check copyright issues before scraping online content.
Upvotes: 0