Reputation: 376
I am trying to get the contents of the node in the webpage I am parsing. Here is my code:
include('simplehtmldom_1_5/simple_html_dom.php');
// get DOM from URL or file
$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";
$html = file_get_html($feedUrl);
$xpath = "/html/body/div[5]/div[1]/div[1]/div[1]/div[5]/div[3]/div[1]/div[1]/div[1]/div[1]/a[1]/div[1]/div[1]/div[3]/div[1]/div[2]/h3[1]/div[1]/a[1]";
foreach($html->find($xpath) as $e)
echo $e->title . '<br>';
In this example, I am trying to get the name "Tax Experience CPA, Inc" from the webpage. The issue is the array returned by find($xpath) is always empty. When I open Google Chrome and search for the node with that xpath, I am able to exactly find the node I want. But this is not working in my code. There must be an issue with the path I am using, but I can't figure out what it is. I have searched and searched but I haven't been able to find what I am doing wrong. Please help.
Upvotes: 0
Views: 1827
Reputation: 4953
The website has lot of nodes with ids and classes, use them to create a shorter simpler xpath expression to retrieve what you want !
Here's a working code for you:
// includes Simple HTML DOM Parser
include "simple_html_dom.php";
$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load_file($feedUrl);
// Find all anchors
$anchors = $html->find("//div[@class='srp-business-name']/a");
// Display all titles
foreach($anchors as $a)
echo $a->title . '<br>';
OUTPUT
Tax Experience CPA Inc
Bernice Hassan CPA Accounting & Tax Services
Begosh Tax Service CPA
At-Home CPA Tax Service
CPA Financial & Tax Service
My Tax CPA
...
Here's a modified code grabbing the title and the phone number from each "element/div".
Notice that find("...", $index)
returns one element specified by $index
(Nth element starting from 0), and returns an array of elements if $index
is not set...
$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load_file($feedUrl);
// Find all elements
$divs = $html->find('div.business-container-inner');
// loop through all elements and display the useful parts
foreach($divs as $div) {
$title = $div->find('div.srp-business-name a', 0)->title;
$phone = $div->find('span.business-phone', 0)->plaintext;
echo $title ." - ". $phone . "<br>";
}
// Clear DOM object
$html->clear();
unset($html);
Upvotes: 1
Reputation: 36
I think, You should try this.
include('simplehtmldom_1_5/simple_html_dom.php');
// get DOM from URL or file
$feedUrl = "http://www.yellowpages.com/md/cpa-tax?menu_search=false&page=1&refinements%5Bfacet_clicked%5D=HeadingText&refinements%5Bheadingtext%5D%5B%5D=Accountants-Certified+Public&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation&refinements%5Bheadingtext%5D%5B%5D=Tax+Return+Preparation-Business";
$html = new simple_html_dom();
$html->load_file($feedUrl);
$xpath = ".srp-business-name a";
foreach($html->find($xpath) as $e)
echo $e->title . '<br>';
Upvotes: 0