Reputation: 1133
I have a HTML table which has the following structure:
<tr>
<td class='tablesortcolumn'>atest</td>
<td >Kunde</td>
<td ><a href="">[email protected]</a></td>
<td align="right"><a href="module/dns_reseller/user_edit.php?ns=3&uid=6952"><img src="images/iconedit.gif" border="0"/></a> <img src="images/pixel.gif" width="2" height="1" border="0"/> <a href="module/dns_reseller/user.php?delete=true&uid=6952" onclick="return confirm('Möchten Sie den Datensatz wirklich löschen?');"><img src="images/icontrash.gif" border="0"/></a></td>
</tr>
There are hundreds of these tr
blocks.
I want to extract atest
and [email protected]
I tried the following:
$document = new DOMDocument();
$document->loadHTML($data);
$selector = new DOMXPath($document);
$elements = $selector->query("//*[contains(@class, 'tablesortcolumn')]");
foreach($elements as $element) {
$text = $element->nodeValue;
print($text);
print('<br>');
}
Extracting atest is no problem, because I can get the element with the tablesortcolumn
class. How can I get the email address?
I cannot simply use //table/tr/td/a
because there are other elements on the website which are structured like this. So I need to get it by choosing an empty href tag. I already tried //table/tr/td/a[contains(@href, '')]
but it returns the same as with //table/tr/td/a
Does anyone have an idea how to solve this?
Upvotes: 2
Views: 56
Reputation: 809
If you are looking for an email field, you could use a regex. Here is an article that could be useful.
EDIT
According to Nisse Engström, I will put the interesting part of the article here in case the blog goes down. Thanks for the advice.
// Supress XML parsing errors (this is needed to parse Wikipedia's XHTML)
libxml_use_internal_errors(true);
// Load the PHP Wikipedia article
$domDoc = new DOMDocument();
$domDoc->load('http://en.wikipedia.org/wiki/PHP');
// Create XPath object and register the XHTML namespace
$xPath = new DOMXPath($domDoc);
$xPath->registerNamespace('html', 'http://www.w3.org/1999/xhtml');
// Register the PHP namespace if you want to call PHP functions
$xPath->registerNamespace('php', 'http://php.net/xpath');
// Register preg_match to be available in XPath queries
//
// You can also pass an array to register multiple functions, or call
// registerPhpFunctions() with no parameters to register all PHP functions
$xPath->registerPhpFunctions('preg_match');
// Find all external links in the article
$regex = '@^http://[^/]+(?<!wikipedia.org)/@';
$links = $xPath->query("//html:a[ php:functionString('preg_match', '$regex', @href) > 0 ]");
// Print out matched entries
echo "Found " . (int) $links->length . " external linksnn";
foreach($links as $linkDom) { /* @var $entry DOMElement */
$link = simplexml_import_dom($linkDom);
$desc = (string) $link;
$href = (string) $link['href'];
echo " - ";
if ($desc && $desc != $href) {
echo "$desc: ";
}
echo "$href\n";
}
Upvotes: 1
Reputation: 22617
The following XPath expression does exactly what you want
//*[@class = 'tablesortcolumn' or contains(text(),'@')]
using the input document you have shown will yield (individual results separated by -------------
):
<td class="tablesortcolumn">atest</td>
-----------------------
<a href="">[email protected]</a>
Upvotes: 1
Reputation: 2962
can you try running an xpath that contains the string @
? It seems unlikely that this would be used for anything else.
so something like this might work
//*[text()[contains(.,'@')]]
Upvotes: 2
Reputation: 237
If you are using Chrome, you can test your XPath queries in the console, like this :
$x("//*[contains(@class, 'tablesortcolumn')]")
Upvotes: 0