user1763180
user1763180

Reputation: 115

HTML Page Scraping

What's the best way to scrape a web page that has AJAX/dynamic loading of data?

For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?

Upvotes: 3

Views: 836

Answers (3)

SyntaxError
SyntaxError

Reputation: 125

Crawljax is open source and can dynamically crawl Ajax-based content.

Upvotes: 1

Cristian Lupascu
Cristian Lupascu

Reputation: 40576

Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.

Then replicate the request in your code.


Update (thanks to pguardiario ofr his comment):

Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.

So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).

Upvotes: 2

Jedi.za
Jedi.za

Reputation: 120

This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".

Google even has a guide on what to do to help them crawl your ajax sites better

Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.

There are also some interesting reads at crawljax

Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.

Good luck

Upvotes: 2

Related Questions