Reputation: 4449
Not sure why I can't get the page from this link. All I want to do is get it and feed into beautifulsoup.
import requests,urllib2
link='https://www.sec.gov/ix?doc=/Archives/edgar/data/1373715/000137371518000157/now-2018630x10q.htm'
r = requests.get(link)
r2=urllib2.urlopen(link)
html=r2.read()
also tried faking a browser with:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(link, headers=headers)
Text is the same... not the page I want.
Getting a header that looks like this
var note = 'The browser mode you are running is not compatible with this application.';
browserName ='Microsoft Internet Explorer';
note +='You are currently running '+browserName+' '+((ie7>0)?7:8)+'.0.';
var userAgent = window.navigator.userAgent.toLowerCase();
if(userAgent.indexOf('ipad') != -1 || userAgent.indexOf('iphone') != -1 || userAgent.indexOf('apple') != -1){
note += ' Please use a more current version of '+browserName+' in order to use the application.';
}else if(userAgent.indexOf('android') != -1){
note += ' Please use a more current version of Google Chrome or Mozilla Firefox in order to use the application.';
}else{
note += ' Please use a more current version of Microsoft Internet Explorer, Google Chrome or Mozilla Firefox in order to use the application.';
}
I can get this page fine:
https://www.sec.gov/Archives/edgar/data/1373715/000137371518000153/erq2fy18-document.htm
which is not a XBRL document. I think it has something to do with the XBRL and the server wants my browser to interact with the data?
Upvotes: 3
Views: 144
Reputation: 15376
It seems that this part of the page is rendered by js. Usually the most reliable option for dynamic content is selenium
, but in this case you can avoid it and use requests
.
It is obvious that the page uses the contents of this document /Archives/edgar/data/1373715/000137371518000157/now-2018630x10q.htm
. You can bypass that page and request the document directly.
import requests
url = "https://www.sec.gov/Archives/edgar/data/1373715/000137371518000157/now-2018630x10q.htm"
r = requests.get(url)
html = r.text
print(html)
Upvotes: 2