user1175307
user1175307

Reputation: 161

Crawl page faster [PHP]

I have a small question about crawling a web page in PHP. I have to crawl about 90 000 products on one big eshop. I tried it in PHP, but one product takes about 2-3 sec and that's bad. Any tips, how to do it faster? Maybe a C++ multithread version? But what about time of a HTTP request? I mean, is it PHP's limitation or not? Thank you for the tips.

Upvotes: 1

Views: 2479

Answers (4)

Justin T.
Justin T.

Reputation: 3701

You have 99% probability that PHP is NOT the problem. It is rather the eshop webserver or any other network latency.

I know this for sure because I have been doing this for months now, and even if your code has lots of regular expressions, data scraping is really fast in PHP.

The solution to speed this ? Pre cache all the website with a command line crawler since disk space is cheap. curl can do this, and httrack as well. It will be much faster and stable than PHP doing the crawling.

Then let PHP do the parsing alone, you will see hopefully PHP chomping dozens of pages per minute, hope this helps :)

Upvotes: 0

Spudley
Spudley

Reputation: 168803

If your program is running slowly, my advice would be to run a profiler on it, and analyse why it's running slowly.

This advice applies to any language, but in the case of PHP, the profiler software you need is called xDebug.

This is a PHP extension, so you need to install it into your server. If you're running on an ISP's server, then you may not have permission to do this, but you can always install it with PHP on your local PC and run your tests there.

Once you've got xDebug installed, switch on the profiling features in PHP.ini (see the xDebug documentation for instruction on this), and run your program. It will then generate profiler files, which can be used to analyse what the program is doing.

Download KCacheGrind to perform the analysis. This will generate call tree information, showing exactly what happened as the program ran, and how long every function call took.

With this information, you can look for the function calls that are running slowly, and work out what's happening. Usually the reason for slow code is some kind of inefficiency in the way something is written; xDebug will help you find it.

Hope that helps.

Upvotes: 0

VoteyDisciple
VoteyDisciple

Reputation: 37813

That's an extremely vague question. When you benchmarked the code you have, what was the slowest part? Was it network transfer times? Using a different language (or multiple threads) won't change that.

Was it time spent parsing the page? How are you doing that? If you're using an XML library to parse the entire DOM, could you get away with just looking for keywords (or even regular expressions)? That's less precise (and in some sense less correct) but perhaps it's faster.

What algorithms are you using for your analysis? Would other data structures provide better performance? As one simple example, if you spend a lot of time iterating over an array, perhaps a hash map is more appropriate.

PHP can be run in multiple processes. What happens if you kick off multiple instances of your script at once (on different pages)? Does the total time decrease?

Ultimately you've described a very general problem so I can't offer very specific solutions, but there is no inherent reason why PHP is inappropriate for this task. When you've identified what's slow (regardless of what language you're using) you should be able to more precisely address how to fix it.

Upvotes: 2

I don't think it's PHPs problem but it could be depending on connection speed/computer speed. I've never had a speed problem with PHP/cURL though.

Just do multiple threads (ie. multiple connections at once), I suggest you use cURL but that's only because I'm familiar with it.

Here's a guide I've used for multiple threads for scraping with cURL: http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

Be VERY careful not to accidentally cause a denial of service situation with your scripts. But I'm sure you're already away of that possibility.

Upvotes: 1

Related Questions