Reputation: 1739
I am currently trying to crawl alot of data from a website, however I am struggling a little bit with it. It has an a-z index and 1-20 index, so it has a bunch of loops and DOM stuff in there. However, it managed to crawl and save about 10.000 rows at first run, but now I am at around 15.000 and it is only crawling around 100 per run.
It is probably because it has to skip the rows that it already has inserted, (made a check for that). I cant think of a way to easily skip some pages, as the 1-20 index varies a lot (for one letter there are 18 pages, other letter are only 2 pages).
I was checking if there already was an record with the given ID, if not, insert it. I assumed that would be slow, so now before the script stars I retrieve all rows, and then check with an in_array(), assuming thats faster. But it just wont work.
So my crawler is navigating 26 letters, 20 pages each letter, and then up to 50 times each page, so if you calculate it, its a lot.
Thought of running it letter by letter, but that wont really work as I am still stuck at "a" and cant just hop onto "b" as I will miss records from "a".
Hope I have explained the problem good enough for someone to help me. My code kinda looks like this: (I have removed some stuff here and there, guess all the important stuff is in here to give you an idea)
function in_array_r($needle, $haystack, $strict = false) {
foreach ($haystack as $item) {
if (($strict ? $item === $needle : $item == $needle) || (is_array($item) && in_array_r($needle, $item, $strict))) {
return true;
}
}
return false;
}
/* CONNECT TO DB */
mysql_connect()......
$qry = mysql_query("SELECT uid FROM tableName");
$all = array();
while ($row = mysql_fetch_array($qru)) {
$all[] = $row;
} // Retrieving all the current database rows to compare later
foreach (range("a", "z") as $key) {
for ($i = 1; $i < 20; $i++) {
$dom = new DomDocument();
$dom->loadHTMLFile("http://www.crawleddomain.com/".$i."/".$key.".htm");
$finder = new DomXPath($dom);
$classname="table-striped";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");
foreach ($nodes as $node) {
$rows = $finder->query("//a[contains(@href, '/value')]", $node);
foreach ($rows as $row) {
$url = $row->getAttribute("href");
$dom2 = new DomDocument();
$dom2->loadHTMLFile("http://www.crawleddomain.com".$url);
$finder2 = new DomXPath($dom2);
$classname2="table-striped";
$nodes2 = $finder2->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname2 ')]");
foreach ($nodes2 as $node2) {
$rows2 = $finder2->query("//a[contains(@href, '/loremipsum')]", $node2);
foreach ($rows2 as $row2) {
$dom3 = new DomDocument();
//
// not so important variable declarations..
//
$dom3->loadHTMLFile("http://www.crawleddomain.com".$url);
$finder3 = new DomXPath($dom3);
//2 $finder3->query() right here
$query231 = mysql_query("SELECT id FROM tableName WHERE uid='$uid'");
$result = mysql_fetch_assoc($query231);
//Doing this to get category ID from another table, to insert with this row..
$id = $result['id'];
if (!in_array_r($uid, $all)) { // if not exist
mysql_query("INSERT INTO')"); // insert the whole bunch
}
}
}
}
}
}
}
Upvotes: 0
Views: 1469
Reputation: 23389
$uid
is not defined, also, this query makes no sense:
mysql_query("INSERT INTO')");
You should turn on error reporting:
ini_set('display_errors',1);
error_reporting(E_ALL);
After your queries you should do an or die(mysql_error());
Also, I might as well say it, if I don't someone else will. Don't use mysql_*
functions. They're deprecated and will be removed from future versions of PHP. Try PDO.
Upvotes: 1