Reputation: 297
the code simply dips into a page and gets all the table content from the specified table inserts it into my db and echoes it.
its doing it very slowly i need ideas to streamline it to work faster
sets the loop
$pagenumber = 1001;
while ($pagenumber <= 5000) {
gets the content
$url = "$pagenumber";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r"," ","\0","\x0B");
$content = str_replace($newlines, '', $raw);
$start = strpos($content,'>Details<');
$end = strpos($content,'</table>',$start);
$table1 = substr($content,$start,$end-$start);
// $table1 = strip_tags($table1);
gets first name
$start = strpos($table1,'<td');
$end = strpos($table1,'<br />',$start);
$fnames = substr($table1,$start,$end-$start);
$fnames = strip_tags($fnames);
$fnames = preg_replace('/\s\s+/', '', $fnames);
gets surname
$start = strpos($table1,'<br />');
$end = strpos($table1,'</td>',$start);
$lnames = substr($table1,$start,$end-$start);
$lnames = strip_tags($lnames);
$lnames = preg_replace('/\s\s+/', '', $lnames);
gets the phone
$start = strpos($table1,'Phone:');
$end = strpos($table1,'</td> </tr> <tr>',$start);
$phone = substr($table1,$start,$end-$start);
$phone = strip_tags($phone);
$phone = str_replace("Phone:", "" ,$phone);
$phone = preg_replace('/\s\s+/', '', $phone);
gets the address
$start = strpos($table1,'Address:');
$end = strpos($table1,'</td> </tr> <tr>',$start);
$ad = substr($table1,$start,$end-$start);
$ad = strip_tags($ad);
$ad = str_replace("Address:", "" ,$ad);
$ad = preg_replace('/\s\s+/', '', $ad);
gets the apartment no
$start = strpos($table1,'Apt:');
$end = strpos($table1,'</td> </tr> <tr>',$start);
$apt = substr($table1,$start,$end-$start);
$apt = strip_tags($apt);
$apt = str_replace("Apt:", "" ,$apt);
$apt = preg_replace('/\s\s+/', '', $apt);
gets the country
$start = strpos($table1,'Country:');
$end = strpos($table1,'</td> </tr> <tr>',$start);
$country = substr($table1,$start,$end-$start);
$country = strip_tags($country);
$country = str_replace("Country:", "" ,$country);
$country = preg_replace('/\s\s+/', '', $country);
gets the city
$start = strpos($table1,'City:<br /> State/Province:');
$end = strpos($table1,'</td> </tr> <tr>',$start);
$city = substr($table1,$start,$end-$start);
$city = strip_tags($city);
$city = str_replace("City: State/Province:", "" ,$city);
$city = preg_replace('/\s\s+/', '', $city);
gets the zip
$start = strpos($table1,'Zip:');
$end = strpos($table1,'</td> </tr> <tr>',$start);
$zip = substr($table1,$start,$end-$start);
$zip = strip_tags($zip);
$zip = str_replace("Zip:", "" ,$zip);
$zip = preg_replace('/\s\s+/', '', $zip);
gets the email
$start = strpos($table1,'email:');
$end = strpos($table1,'</td> </tr>',$start);
$email = substr($table1,$start,$end-$start);
$email = strip_tags($email);
$email = str_replace("email:", "" ,$email);
$email = preg_replace('/\s\s+/', '', $email);
echoes the row
echo "<tr>
<td><a href='$pagenumber'>link</a></td>
includes db info
$tablename = 'list';
$fnames = mysql_real_escape_string($fnames);
$lnames = mysql_real_escape_string($lnames);
$phone = mysql_real_escape_string($phone);
$ad = mysql_real_escape_string($ad);
$apt = mysql_real_escape_string($apt);
$country = mysql_real_escape_string($country);
$city = mysql_real_escape_string($city);
$zip = mysql_real_escape_string($zip);
$email = mysql_real_escape_string($email);
inserts row to db
$query = "INSERT INTO $tablename VALUES('', '$pagenumber', '$fnames', '$lnames', '$phone', '$ad',
'$apt','$country','$city','$zip', '$email')";
mysql_query($query) or die(mysql_error());
resets the loop
$pagenumber = $pagenumber + 1;
Upvotes: 0
Views: 768
Reputation: 55002
Don't use regex for html. You should use xpath, and for php specifically, DOMXPath
Upvotes: 1
Reputation: 1289
You could take a look at curl
After grabbing the pages(s) you could us a single pattern to grab all required fields. Matches can be done with preg_match_all
Also is there not any xml/rss feed available for the data you are seeking ? See if you can show more results per page on your example site , this would reduce the number of pages you need to crawl.
edit : as requested a simple example :
Make sure you have curl enabled on your server :
echo 'cURL is '.(function_exists('curl_init') ?: ' not').' enabled';
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, '' );
curl_setopt($ch, CURLOPT_REFERER, '');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$page =curl_exec ($ch);
Upvotes: 0