Reputation: 473
we've got an ancient (internal) website with static info. We're going to replace it with something better, therefore I need to fetch all info. I used to do this via regex, but lately I stumbled about a few articles stating that using regex to parse info from HTML is inviting cthulhu to this realm.
So I decided to learn a few new tricks, start over and do it the DOM-way. the HTML part I need looks like this:
<table id="articles">
<tr>
<th>
<a href='articles/aa123.html'><img src="/iamges/aa123.jpg" alt="some article"></a>
<br />short description
</th>
<td>
<table class='details'>
<tr><th><a href='articles/aa123.html'>Some Article</a></th></tr>
<tr><th>Type:</th><td>article type</td></tr>
<tr><th>Price:</th><td>€ 99</td></tr>
<tr><th>Manufacturer:</th><td>Some Company</td></tr>
<tr><th>Warehouse:</th><td>x</td></tr>
</table>
</td>
</tr>
</table>
And so far I got this:
$dom = new DOMDocument();
@$dom->loadHTMLFile ($file);
$xpath = new DOMXPath($dom);
$query = "/html/body/table[@id='articles']//th"; //catch all TH's
$data = $xpath->evaluate($query);
And that's about where I get stuck. I know all content of the returned TH's is in the ChildNodes, but I'm having a hard time getting the values. I need the URL to the details page and the value for the Price column.
How do I get those extracted?
Currently I came up with the following:
$query = '//table[@class="details"]//td';
$data= $xpath->evaluate($query);
$c = $ths->length;
for ($i = 0; $i < $c; $i++) {
echo htmlentities($data->item($i)->nodeValue);
}
But this only displays the text values from the TD's. When the content is a link, it only show the link-title. Not the URL.
UPDATE Thanks to Fab's suggestion I managed to book some progress. Currently I got the following:
$tables = $xpath->query('//table[@class="details"]');
foreach($tables as $table) {
$url = $xpath->evaluate('//th/a/@href', $table);
$articleName= $xpath->evaluate('//th/a', $table);
$Manufacturer= $xpath->evaluate('//th[text()="Manufacturer:"]/../td', $table);
echo 'articleName:' . $articleName . ' <br />';
echo 'Manufacturer:' . $Manufacturer. ' <br />';
echo 'url:' . $url. ' <br />';
echo '<br />';
}
But for some reason it always displays the data from the first acticle (repeated for as many articles as there are on the page). As if the 'foreach' statement always returns the 1st found table. Any tips?
Upvotes: 3
Views: 209
Reputation: 24576
XPath for the URLs would be:
//table[@class="details"]//th/a@href
And for the price columns:
//table[@class="details"]//th[text()="Price:"]/../td
Probably you will want to get URL and price for each table separately, for this you could first collect a DOMNodeList
with all "details" tables and then search within (using the context parameter):
$tables = $xpath->query('//table[@class="details"]');
foreach($tables as $table) {
$url = $xpath->evaluate('//th/a@href', $table);
$price = $xpath->evaluate('//th[text()="Price:"]/../td', $table);
echo "$url - $price <br>";
}
UPDATE
I forgot one thing: the context parameter only takes effect with relative paths and //th/...
is absolute. You have to add a dot in the beginning: .//th/...
Have a look: working demo
(I also had to exchange evaluate
for query
and explicitly access the value of the first item:
$xpath->query(...)->item(0)->nodeValue;
Upvotes: 1