Reputation: 13361
I'm considering developing a site where the server will crawl another site periodically, in order to gather content for certain entries in my database. My quesitons are as follows...
Basically, the kind of thing I want to do, is for the server to execute a script (say every hour), which finds all entries in the database which haven't yet been crawled on another site. It will take a certain value from those entries, and will use them to crawl another site... it might request a url like this: www.anothersite.com/images?q=entryindb
.
What I want it to do is then crawl the HTML, return an array, and log the values in the database. This is what I want the crawler to look for
Find all instances of
<img> inside <a> inside <td> inside <tr> inside <tbody> inside <table> inside <div id='content'>
Return array of the img.src from all instances.
Is something like that possible? - If so, how would I go about doing it? - Please bear in mind that web dev wise, the only experience I have so far (server-side) is with PHP.
UPDATE: I will be using a linux-based server, so I guess chron-scripting is how I should do it?
Upvotes: 2
Views: 1034
Reputation: 45589
Download phpQuery-0.9.5.386-onefile.zip
from here.
require_once 'phpQuery-onefile.php';
$html = file_get_contents('http://www.othersite.com');
phpQuery::newDocumentXHTML($html);
$elements = pq('#content table tbody tr td a img');
$images = array();
foreach($elements as $img){
$images[] = pq($img)->attr('src');
}
The $images
array will have a list of all the image sources.
crawler.php
Then in the cron tab, if you want the crawler to run every hour, you would do:
0 * * * * php /path/to/your/crawler.php
Upvotes: 4
Reputation: 20601
You could fetch the HTML with cURL
(screenscraping) and write the HTML parser with php's DOMDocument
. If the HTML is messy, you can not read it directly with DOMDocument
, but you could "wash it" with for example HTMLPurifier which takes invalid HTML and spits it out all valid.
To start the process, make your php script able to run via CLI (the command line, contrary to a webserver which of course is used for a browser).
After you have this script, setup a cronjob (if you have a Linux server) to run your script in what ever period you want.
Google the bolded words.
Upvotes: 2
Reputation: 8125
I would use cron for this. However, PHP might not be the best choice, unless you've already written the script. Python and BeautifulSoup might be most appropriate to scrape the URLs.
Upvotes: 0
Reputation: 4187
NOTE: Check with the T+Cs of the sites you want to scrape before hand to see if they allow it.
http://php.net/file_get_contents
Upvotes: 2