Reputation: 4103
I am trying to scrape this website https://www.spdrs.com/product/fund.seam?ticker=SPY using urllib2 and beautifulSoup. However, I figured that the html I got from urllib2 is not complete. Anything between the node <span>
as shown below are not part of the string read from urllib2.
<span xmlns="http://www.w3.org/1999/xhtml" id="performancePanel">
bunch of divs in here.
</span>
Why is this the case? I suspect it has something to do with the xmlns, because I have never seen anyone put this attribute on a span.
Upvotes: 0
Views: 292
Reputation: 6520
If you view source in your browser, you will get the same view the urllib gets.
You can see it looks like this:
<span id="performancePanel"></span>
Notice how there are no divs in that span. The divs are populated by javascript. Look at the bottom of the source and you will see some js code and the comment
<!-- load performance and holdings content by ajax -->
I think that is where it gets loaded.
Since the data is loaded by javascript, it will be difficult to scrape via urllib unless you reverse engineer the javascript and figure out the underlying APIs it uses and then scrape those.
If that is too difficult, you might want to investigate using selenium to scrape the data.
Upvotes: 1