janoliver
janoliver

Reputation: 7824

PHP Regexp: Subpattern that might occur more than once

I'm trying to write a regular expression for html code that looks like this:

<tr>
    <td>I'm some text</td>
    <td>1234</td>
    <td>1231</td>
</tr>
<tr>
    <td>I'm some text</td>
    <td>1234</td>
    <td>1231</td>
    <td>7181</td>
</tr>

Now I want an expression that looks for every table row and can handle dynamic numbers of ([0-9]{4}). So if there are two cells, I'd like to get an array with the two values, if there are three, there should be all three values inside my array.

My regexp HAS TO start and end with:

!<tr> ..... </tr>!sU

Is that possible?

Upvotes: 0

Views: 443

Answers (3)

Ferdinand Beyer
Ferdinand Beyer

Reputation: 67197

Now I want an expression that looks for every table row and can handle dynamic numbers of ([0-9]{4}). So if there are two cells, I'd like to get an array with the two values, if there are three, there should be all three values inside my array. (...) Is that possible?

No, it's not. You cannot write a pattern with a dynamic number of sub-patterns.

My regexp HAS TO start and end with:
!<tr> ..... </tr>!sU

Why is that?

If you really want to use regular expressions instead of using a XML parser or something more forgiving like Tidy, I suggest a two-step approach.

First step: Find <tr> rows:

!<tr>(.*?)</tr>!

Second step: Iterate over the results and look for <td>s:

!<td>(?:<[^>]+>)*(\d{4})(?:<[^>]+>)*</td>!

This will find sequences of 4 decimal characters (0-9) within <td> and also matches nested formatting tags like

<td><strong>1234</strong></td>

Upvotes: 1

Jonathan Fingland
Jonathan Fingland

Reputation: 57177

regexp is notoriously bad at evaluating hierarchical structures and especially so with xml. You are much better off using SimpleXML, or DOMDocument with DOMXPath

See http://www.php.net/manual/en/simplexmlelement.xpath.php for how to use Xpath with SimpleXML

and

http://www.php.net/manual/en/domxpath.evaluate.php for how it can be done with DOMXPath.

Note that if your case is as simple as given in the question, then SimpleXML is the better choice. There are some cases where DOMDocument would be more appropriate so it'd be good to have more info for that decision

For example:

<?php
$string = <<<XML
<table>
  <tr>
    <td>I'm some text</td>
    <td>1234</td>
    <td>1231</td>
  </tr>
  <tr>
    <td>I'm some text</td>
    <td>1234</td>
    <td>1231</td>
    <td>7181</td>
  </tr>
</table>
XML;

$xml = new SimpleXMLElement($string);

/* Search for <a><b><c> */
$result = $xml->xpath('//tr/td[text() = number(text())');

while(list( , $node) = each($result)) {
    echo $node,"\n";
}

?>

Upvotes: 1

user187291
user187291

Reputation: 53950

this should help you get started

$html = ...as above
preg_match_all('~<tr>.+?(\d+).+?</tr>~si', $html, $matches);
print_r($matches);

Upvotes: 2

Related Questions