Reputation: 619
I am designing a web crawler to fetch a list of products from a site. I have tried simple HTML DOM parser and file_get_contents() to fetch HTML and parse it. But it is taking too much time to fetch the HTML content. Also a lots of parsing overhead as it is a huge size page. I am looking for a way if possible to fetch only required HTML content to speed up fetching. like.. using offset and maxlen parameters in file_get_contents(). but Seeking (offset) is not supported with remote files.
string file_get_contents ( string $filename,false, 9000, 5000)
Does there any other way to do this?
Upvotes: 1
Views: 358
Reputation: 522500
It is possible to do this at the HTTP protocol level using the Range
headers in the request. But, it is not guaranteed that the other server understands or honors them. Further, do you really know the exact byte offset of the content that interests you? Sounds like that would be really brittle. Also, if you're only fetching a partial HTML document, you may have a hard time parsing it.
Look at the $context
parameter of file_get_contents
and related documentation about Contexts for setting HTTP headers and try the Range
header.
Upvotes: 1