Reputation: 43
I am developing a web scraping algorithm using Priority Queue. I have a seed URL, I parsed all its links according to an algorithm. Then i putting all parsed URLs inside Priority Queue according to scores they've got from the algorithm. The algorithm start to select new seed URL from Priority Queue according to links scores. When a link is selected to be seed URL, it is dequeued from the Priority Queue and so on. The program is running without any problem. But the problem is:
Since enqueue Links operation number is getting bigger than dequeue Links operation number, the size of Priority Queue is getting bigger and bigger by the time. How can I control of it ? and is the size of Priority Queue will affect the performance of my crawler ?
When I am try to get number of crawled URLs per minutes, i am getting low results by the time. Ex: after running the program for 1 hour, the average of crawled page is getting low than if i run the program for 15 minutes and get the average of crawled pages. Is this happened because of Priority Queue size ? How to fix this?
I have these two issues and need your help and your idea to solve this problem in my crawled algorithm.
Upvotes: 1
Views: 134
Reputation: 5594
As rafalio noted, the size of your queue will increase continually because an average web page contains more than 1 outbound link and you will not reach closure over the whole internet :-)
But rather than searching the top N links from each page as rafalio suggests, I would instead recommend that you set a global maximum cap on the priority queue (say 30,000, for example). This way, as it begins to grow, each new link you traverse (the next highest priority link in the queue) will have accumulated greater and greater relative priority to everything else encountered thus far. You could be almost certain that by the time the queue reaches the cap, the highest priority item in it will be more important than the top N links on the current page (or any arbitrary page you could visit next).
Keep in mind that if your priority queue is array-backed, the insertion and removal ops will be O(n). A sorted linked list will also have O(n) insertion time. A binary-heap-backed priority queue should give the best overall performance for large n, with no worse than O(log n) on both insert and remove, and O(1) on min/max item lookup.
Upvotes: 2
Reputation: 3946
The size of the priority queue will obviously grow large as pages usually contain more than one other link. This is basically a search over the whole graph induced by the pages and other URLs they contain, so with your current algorithm you will end up traversing the whole graph. Possibly figure out only the top N links from each page and only follow those.
Regarding speed. Visiting websites is parralelizable so you start with that, by, say, always crawling k
links at a time. Otherwise, are you using a good implementation of a priority queue ? Are you properly closing all the network connections that your program opens?
Upvotes: 2