Reputation: 31

preg_match_all not capturing all intended results

I'm trying to get some info from the following source:

<random htmlcode here>
<td style="BORDER-RIGHT-STYLE:none;">
      <a id="dgWachtlijstFGI_ctl03_hlVolnaam" title="Klant wijzigen" class="wl" href="javascript: Pop(600,860,'klantwijzig','FrmKlant.aspx','?  Wijzig=true&amp;lcSchermTitel=&amp;zoekPK=+++140+12++8',false,true); ">FIRST LINE A</a>
      (SECOND LINE A)<br>
      THIRD LINE A        </td>
<random htmlcode here>
<td style="BORDER-RIGHT-STYLE:none;">
      <a id="dgWachtlijstFGI_ctl04_hlVolnaam" title="Klant wijzigen" class="wl" href="javascript: Pop(600,860,'klantwijzig','FrmKlant.aspx','?Wijzig=true&amp;lcSchermTitel=&amp;zoekPK=+++140+12++8',false,true); ">FIRST LINE B</a>
       (SECOND LINE B)<br>
      THIRD LINE B        </td>
<random htmlcode here>

What I came up with this far is the following (thanks to rubular.com)

<?php $bestand = 'input.htm';
$fd = fopen($bestand,"r");
$message = fread($fd, filesize    ($bestand));
$regexp = "FrmKlant.aspx.*\">(.*)<\/a>\s(.*)<br>\s(.*)\s\s(.*)"; 
if   (preg_match_all("#$regexp#siU", $message, $matches)) 
{   
print_r($matches);
}?
>

This actually seems to put the first and second line I need in a multidimensional array. So far so good, because I want a multidimensional array. However, it doesn't seem to capture the 3rd line. And somehow it creates array[4]

[1] => Array ( [0] => FIRST LINE A [1] => FIRST LINE B ) 
[2] => Array ( [0] =>  (SECOND LINE A) [1] => (SECOND LINE B) ) 
[3] => Array ( [0] => [1] => ) [4] => Array ( [0] => [1] => )

What I'm looking for is this:

[0] => Array ( [0] => FIRST LINE A [1] => FIRST LINE B ) 
[1] => Array ( [0] =>  (SECOND LINE A) [1] =>  (SECOND LINE B) ) 
[2] => Array ( [0] => THIRD LINE A [1] => THIRD LINE B ) )

Upvotes: 3

Answers (3)

Ivar Bonsaksen

Reputation: 4767

Use PHP's DOM parser

Incomplete example, but something to get you started:

$dom = new DOMDocument();
$dom->loadHTML($yourHtmlDocument);

$xPath = new DOMXPath($dom);
$elements = $xPath->query('\\random\td\a'); // Or whatever your real path would be

foreach($elements as $node) {
  echo $node->nodeValue;
}

By the way, look at this.

Upvotes: 5

Jens

Reputation: 25563

It is usually not a good idea, to try and extract information from HTML/XML using regular expressions. They a renot well suited to deal with nested structures. Everything you can try will horribly break if your "random html" parts are evil enough, so use them only if have very good control over the html.

Try a parser instead. (Google found me http://simplehtmldom.sourceforge.net/, I have not tried it, though)

Upvotes: 0

amphetamachine

Reputation: 30595

$regexp = "FrmKlant.aspx.*\">(.*)<\/a>\s(.*)<br>\s(.*)\s\s(.*)</td>";

Upvotes: 0

preg_match_all not capturing all intended results

Answers (3)

Related Questions