Yogesh R.L
Yogesh R.L

Reputation: 619

web crawler: fetch only useful html content to speed up fetching in php

I am designing a web crawler to fetch a list of products from a site. I have tried simple HTML DOM parser and file_get_contents() to fetch HTML and parse it. But it is taking too much time to fetch the HTML content. Also a lots of parsing overhead as it is a huge size page. I am looking for a way if possible to fetch only required HTML content to speed up fetching. like.. using offset and maxlen parameters in file_get_contents(). but Seeking (offset) is not supported with remote files.

 string file_get_contents ( string $filename,false, 9000, 5000)

Does there any other way to do this?

Upvotes: 1

Views: 358

Answers (1)

deceze
deceze

Reputation: 522500

It is possible to do this at the HTTP protocol level using the Range headers in the request. But, it is not guaranteed that the other server understands or honors them. Further, do you really know the exact byte offset of the content that interests you? Sounds like that would be really brittle. Also, if you're only fetching a partial HTML document, you may have a hard time parsing it.

Look at the $context parameter of file_get_contents and related documentation about Contexts for setting HTTP headers and try the Range header.

Upvotes: 1

Related Questions