Reputation: 31
I'm trying to get some info from the following source:
<random htmlcode here>
<td style="BORDER-RIGHT-STYLE:none;">
<a id="dgWachtlijstFGI_ctl03_hlVolnaam" title="Klant wijzigen" class="wl" href="javascript: Pop(600,860,'klantwijzig','FrmKlant.aspx','? Wijzig=true&lcSchermTitel=&zoekPK=+++140+12++8',false,true); ">FIRST LINE A</a>
(SECOND LINE A)<br>
THIRD LINE A </td>
<random htmlcode here>
<td style="BORDER-RIGHT-STYLE:none;">
<a id="dgWachtlijstFGI_ctl04_hlVolnaam" title="Klant wijzigen" class="wl" href="javascript: Pop(600,860,'klantwijzig','FrmKlant.aspx','?Wijzig=true&lcSchermTitel=&zoekPK=+++140+12++8',false,true); ">FIRST LINE B</a>
(SECOND LINE B)<br>
THIRD LINE B </td>
<random htmlcode here>
What I came up with this far is the following (thanks to rubular.com)
<?php $bestand = 'input.htm';
$fd = fopen($bestand,"r");
$message = fread($fd, filesize ($bestand));
$regexp = "FrmKlant.aspx.*\">(.*)<\/a>\s(.*)<br>\s(.*)\s\s(.*)";
if (preg_match_all("#$regexp#siU", $message, $matches))
{
print_r($matches);
}?
>
This actually seems to put the first and second line I need in a multidimensional array. So far so good, because I want a multidimensional array. However, it doesn't seem to capture the 3rd line. And somehow it creates array[4]
[1] => Array ( [0] => FIRST LINE A [1] => FIRST LINE B )
[2] => Array ( [0] => (SECOND LINE A) [1] => (SECOND LINE B) )
[3] => Array ( [0] => [1] => ) [4] => Array ( [0] => [1] => )
What I'm looking for is this:
[0] => Array ( [0] => FIRST LINE A [1] => FIRST LINE B )
[1] => Array ( [0] => (SECOND LINE A) [1] => (SECOND LINE B) )
[2] => Array ( [0] => THIRD LINE A [1] => THIRD LINE B ) )
Upvotes: 3
Views: 170
Reputation: 4767
Use PHP's DOM parser
Incomplete example, but something to get you started:
$dom = new DOMDocument();
$dom->loadHTML($yourHtmlDocument);
$xPath = new DOMXPath($dom);
$elements = $xPath->query('\\random\td\a'); // Or whatever your real path would be
foreach($elements as $node) {
echo $node->nodeValue;
}
Upvotes: 5
Reputation: 25563
It is usually not a good idea, to try and extract information from HTML/XML using regular expressions. They a renot well suited to deal with nested structures. Everything you can try will horribly break if your "random html" parts are evil enough, so use them only if have very good control over the html.
Try a parser instead. (Google found me http://simplehtmldom.sourceforge.net/, I have not tried it, though)
Upvotes: 0
Reputation: 30595
$regexp = "FrmKlant.aspx.*\">(.*)<\/a>\s(.*)<br>\s(.*)\s\s(.*)</td>";
Upvotes: 0