Anagmate
Anagmate

Reputation: 1893

File get contents params

I'm making a PHP crawler to explore e-shop called alza.cz. I want links to all products in that e-shop. I'm on address: http://www.alza.cz/notebooky/18842920.htm., but this display only first 21 items. To get all items I must go to address: http://www.alza.cz/notebooky/18842920.htm#f&pg=1/10000.

Crawler uses file_get_contents to get HTML of the page, which is then parsed using DOM. Problem is, that it looks like that file_get_contents ignores that part after # (returns only first 21 items instead of all). Any ideas?

Upvotes: 1

Views: 151

Answers (1)

Paul Dixon
Paul Dixon

Reputation: 300825

file_get_contents would ignore the #xxxxx part of the URL (the fragment identifier), and would not include it in the requested URL. It's something a user agent would use on the client side - most likely, the website has some Javascript which would use AJAX to load a new page of results.

You could see if the page obeys the Google AJAX Crawling Specification, though based on your example, it doesn't look like it. If you see "hash bang" fragment identifiers like #!foo=bar, that's a good sign.

So, you'll need to observe the AJAX requests in Firebug or similar and replicate the same requests yourself.

Upvotes: 1

Related Questions