Reputation: 393
I am currently building a scraper to scrape certain information from a website.
For example, I would like to get a restaurant name, address, opening hours & telephone number from a website.
By using curl, I managed to get the data from the website:
$url = "http://localhost/test.html";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
curl_close($ch);
However, I need some ideas on how would I be able to pin point my scraper to the exact location to scrape these information out.
I have tried regular expressions, but was unable to get it to work.
Upvotes: 1
Views: 777
Reputation: 3998
Use SimpleHTMLDom parser for php:
http://simplehtmldom.sourceforge.net/
Download here:
http://sourceforge.net/projects/simplehtmldom/files/
Documentation here:
http://simplehtmldom.sourceforge.net/manual.htm
That is as I have experience with parsing the best tool for parsing HTML with php...
Also you don't need to use curl for getting content if it is not necessary, for simpleHTMLDom parser just use:
$remote_html = file_get_html("http://www.somesite.com/");
Upvotes: 3
Reputation: 3446
Take a look at XPath querying: http://php.net/manual/en/domxpath.query.php
I use the equivalant method for website scraping in C#. Same standard employed here. Most excellent.
Upvotes: 1