Ajay Nair
Ajay Nair

Reputation: 1867

Reading dynamically generated web pages using python

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes. What I see in the source code is:

<div id="cntnt"></div>

But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.

I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !

Upvotes: 24

Views: 42793

Answers (4)

TheHeadlessSourceMan
TheHeadlessSourceMan

Reputation: 737

TRY THIS FIRST!

Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)

But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!

It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.

Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Upvotes: 0

ivan_pozdeev
ivan_pozdeev

Reputation: 35998

A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.

Keep in mind when choosing that some previously major products of those are abandoned now.

Upvotes: 0

Andrey Nikishaev
Andrey Nikishaev

Reputation: 3882

You need JavaScript Engine to parse and run JavaScript code inside the page. There are a bunch of headless browsers that can help you

http://code.google.com/p/spynner/

http://phantomjs.org/

http://zombie.labnotes.org/

http://github.com/ryanpetrello/python-zombie

http://jeanphix.me/Ghost.py/

http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

Upvotes: 17

ppsreejith
ppsreejith

Reputation: 3438

The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer

Upvotes: 6

Related Questions