Alex Coplan
Alex Coplan

Reputation: 13361

Periodic HTML crawl

I'm considering developing a site where the server will crawl another site periodically, in order to gather content for certain entries in my database. My quesitons are as follows...

  1. How do you get the server to execute a crawl timely?
  2. Can you get it to execute php or what language do you use to perform the crawl?
  3. Are there any good APIs to do this?
  4. Should I consider building my own? If so, some advice on how to get started would be great

Basically, the kind of thing I want to do, is for the server to execute a script (say every hour), which finds all entries in the database which haven't yet been crawled on another site. It will take a certain value from those entries, and will use them to crawl another site... it might request a url like this: www.anothersite.com/images?q=entryindb.

What I want it to do is then crawl the HTML, return an array, and log the values in the database. This is what I want the crawler to look for

Find all instances of 
<img> inside <a> inside <td> inside <tr> inside <tbody> inside <table> inside <div id='content'>
Return array of the img.src from all instances.

Is something like that possible? - If so, how would I go about doing it? - Please bear in mind that web dev wise, the only experience I have so far (server-side) is with PHP.

UPDATE: I will be using a linux-based server, so I guess chron-scripting is how I should do it?

Upvotes: 2

Views: 1034

Answers (4)

Shef
Shef

Reputation: 45589

  1. You can use cron
  2. Yes, you can run a PHP script
  3. Nothing like a complete crawling API (AFAIK), but there are classes which will help you parse and traverse DOM documents.
  4. You can set something up in minutes if you follow the following steps

1. You need phpQuery to make your life easier with this

Download phpQuery-0.9.5.386-onefile.zip from here.

2. Your PHP file would be something like this

require_once 'phpQuery-onefile.php';
$html = file_get_contents('http://www.othersite.com');
phpQuery::newDocumentXHTML($html);

$elements = pq('#content table tbody tr td a img');
$images = array();
foreach($elements as $img){
    $images[] = pq($img)->attr('src');
}

The $images array will have a list of all the image sources.

3. Save the above code in a file, say crawler.php

Then in the cron tab, if you want the crawler to run every hour, you would do:

0 * * * * php /path/to/your/crawler.php 

Upvotes: 4

chelmertz
chelmertz

Reputation: 20601

You could fetch the HTML with cURL (screenscraping) and write the HTML parser with php's DOMDocument. If the HTML is messy, you can not read it directly with DOMDocument, but you could "wash it" with for example HTMLPurifier which takes invalid HTML and spits it out all valid.

To start the process, make your php script able to run via CLI (the command line, contrary to a webserver which of course is used for a browser).

After you have this script, setup a cronjob (if you have a Linux server) to run your script in what ever period you want.

Google the bolded words.

Upvotes: 2

Christian Mann
Christian Mann

Reputation: 8125

I would use cron for this. However, PHP might not be the best choice, unless you've already written the script. Python and BeautifulSoup might be most appropriate to scrape the URLs.

Upvotes: 0

Jonnix
Jonnix

Reputation: 4187

  1. You can use cron assuming you're hosting on Linux.
  2. Yes you can use it to run some PHP.
  3. None that I know of, but I've never looked.
  4. That's up to you. See the following documentation that I feel might be useful to you.

NOTE: Check with the T+Cs of the sites you want to scrape before hand to see if they allow it.

http://php.net/file_get_contents

http://php.net/curl

http://php.net/domdocument

Upvotes: 2

Related Questions