Reputation: 1893
I'm making a PHP crawler to explore e-shop called alza.cz. I want links to all products in that e-shop. I'm on address: http://www.alza.cz/notebooky/18842920.htm., but this display only first 21 items. To get all items I must go to address: http://www.alza.cz/notebooky/18842920.htm#f&pg=1/10000.
Crawler uses file_get_contents
to get HTML of the page, which is then parsed using DOM. Problem is, that it looks like that file_get_contents
ignores that part after # (returns only first 21 items instead of all). Any ideas?
Upvotes: 1
Views: 151
Reputation: 300825
file_get_contents would ignore the #xxxxx
part of the URL (the fragment identifier), and would not include it in the requested URL. It's something a user agent would use on the client side - most likely, the website has some Javascript which would use AJAX to load a new page of results.
You could see if the page obeys the Google AJAX Crawling Specification, though based on your example, it doesn't look like it. If you see "hash bang" fragment identifiers like #!foo=bar
, that's a good sign.
So, you'll need to observe the AJAX requests in Firebug or similar and replicate the same requests yourself.
Upvotes: 1