Reputation: 1202
Looking for the best way to get the content of some HTML text in some random pieces of HTML
I cannot seem to figure out the regex for it.
<td valign="top" style="border: solid 1px black; padding: 4px;">
<h4>Dec 05, 2015 23:16:52</h4>
<h3>rron7pam has won</h3>
</td>
<table width="100%" style="border: 1px solid #DED3B9" id="attack_info_att">
<tbody>
<tr>
<th style="width:20%">Attacker:</th>
<th><a title="..." href="/guest.php?screen=info_player&id=255995">Bliksem</a></th>
</tr>
</tbody>
</table>
The above are only examples, but for these examples, I am interested in
There are lots more information that I need from separate HTML code pieces, but if I can get one or two right, I might be able to get some more.
EDIT based on comments and answers: There could be any arbitrary text in the HTML, depending on how the report was set up (to hide attacker's units, etc.) I need to look for patterns of specific HTML tags
In the example above, "The text between the <h4></h4>
tags directly following a set of <h3></h3>
tags inside a <td>
" will be the date that I need.
Some examples of links with different formats:
https://enp2.tribalwars.net/public_report/70d3a2a55461e9eb09f543958b608304 https://enp2.tribalwars.net/public_report/5216e0e16c9d3657f981ce7e3cb02580
There are elements that will always be the same, as far as I can tell, e.g., as per the above to get the date.
Upvotes: 1
Views: 86
Reputation: 89565
An example with DOMDocument
:
$url = 'https://enp2.tribalwars.net/public_report/70d3a2a55461e9eb09f543958b608304';
// prevent warnings to be displayed
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
# lets find interesting nodes:
// td that contains all the needed informations (the nearest common ancestor in other words)
$rootNode = $xp->query('(//table[@class="vis"]/tr/td[./h4])[1]')->item(0);
// first h4 node that contains the date
$dateNode = $xp->query('(./h4)[1]', $rootNode)->item(0);
// following h3 node that contains the player name
$winnerNode = $xp->query('(./following-sibling::h3)[1]', $dateNode)->item(0);
$attackerNode = $xp->query('(./table[@id="attack_info_att"]/tr/th/a)[1]', $rootNode)->item(0);
# extract special values
$winner = preg_replace('~ has won$~', '', $winnerNode->nodeValue);
$attackerID = html_entity_decode($attackerNode->getAttribute('href'));
$attackerID = parse_url($attackerID, PHP_URL_QUERY);
parse_str($attackerID, $queryVars);
$attackerID = $queryVars['id'];
$result = [ 'date' => $dateNode->nodeValue,
'winner' => $winner,
'attacker' => $attackerNode->nodeValue,
'attackerID' => $attackerID ];
print_r($result);
Upvotes: 3
Reputation: 3093
it wouldnt be pretty but could you use strpos
to return the start and end position of the tags/content. Then use substr
to return that portion of the string.
string substr ( string $string , int $start [, int $length ] )
mixed strpos ( string $haystack , mixed $needle [, int $offset = 0 ] )
I would say that having to do it like this probably means there is something wrong with how your recieving the data/further up. I really odnt think it's going to be efficient to keep scanning the dom over and over.
Upvotes: 0