Reputation: 19
I'm working on crawler project and I need some help from you, this is my first project. The task is to fetch the data from 'http://justdial.com'. for example, I want to fetch the city name(bangalore), categoury(hotels), hotel name, address and phone number.
I have written a code to fetch the tag content from its 'id', like I have fetched the address from this:
<?php
$url="http://www.justdial.com/Bangalore/hotels";
$original_file = file_get_contents("$url");
$stripped_file = strip_tags($original_file, "<div>");
$newlines="'<div class=\"logoDesc\">(.*?)</div>'si";
$newlines=preg_replace('#<div(?:[^>]*)>.</div>#u','',$newlines);
preg_match_all("$newlines", $stripped_file, $matches);
//DEBUGGING
//$matches[0] now contains the complete A tags; ex: <a href="link">text</a>
//$matches[1] now contains only the HREFs in the A tags; ex: link
header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
$path= ($matches);
print_r($path); //View the array to see if it worked
?>
Now the problem is, I want to seperate the tags from the contents and store it in a database. And from database to the excel sheet. Please help me.
Upvotes: 0
Views: 2014
Reputation: 19879
You should not be using regex to parse HTML. You should be using something like DomDocument. Small example of it in use:
<?php
$str = '<h1>T1</h1>Lorem ipsum.<h1>T2</h1>The quick red fox...<h1>T3</h1>... jumps over the lazy brown FROG';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
//get all H1
$items = $DOM->getElementsByTagName('h1');
//display all H1 text
for ($i = 0; $i < $items->length; $i++)
echo $items->item($i)->nodeValue . "<br/>";
?>
Upvotes: 1