Reputation: 2950

Unusual behaviour of regex

My Setup:

index.php:

<?php
$page = file_get_contents('a.html');
$arr = array();
preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);
print_r($arr);
?>

a.html:

...other content
<td class="myclass"> 
    THE 
  CONTENT 
</td>
other content...

Output:

Array
(
    [0] => Array
        (
        )
)

If I change the line 4 of index.php to:

preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);

The output is:

Array
(
    [0] => <td class="myclass">
     THE 
   CONTENT
</t
    [1] => 
     THE 
   CONTENT
)

I can't make out what's wrong. Please help me match the content between `<td class="myclass">` and `</td>`.

Upvotes: 1

Answers (3)

Anthony Hatzopoulos

Reputation: 10537

Do you understand that the 3rd paramter of preg_match is the matches and it will contain the match then the other elements will show the captured pattern.

https://www.php.net/manual/en/function.preg-match.php

If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

This code preg_match('/<td class=\"myclass\">(.*)\<\/t/s',$page,$arr);

When applied on

...other content
<td class="myclass"> 
    THE 
  CONTENT 
</td>
other content...

Will return the match in $arr[0] and the result of (.*) in $arr[1]. This result is correct: There is your content in [1]

Array
(
    [0] => <td class="myclass">
    THE
  CONTENT
</t
    [1] => 
    THE
  CONTENT

Example two

<?php
header('Content-Type: text/plain');
$page = 'A B C D E F';
$arr = array();
preg_match('/C (D) E/', $page, $arr);
print_r($arr);

Example output

Array
(
    [0] => C D E  // This is the string found
    [1] => D      // this is what I wanted to look for and extracted out of [0], the matched parenthesis
)

Upvotes: 1

LSerni

Reputation: 57408

Your code appears to work. I edited the regex to use a different separator and get a clearer view. You may want to use the ungreedy modifier in case there is more than one myclass TD in your HTML.

I have not been able to reproduce the "array of array" behaviour you note, unless I manipulate the code to add an error -- see at bottom.

<?php
        $page = <<<PAGE
        ...other content
        <td class="myclass">
            THE
          CONTENT
        </td>
        other content...
PAGE;

        preg_match('#<td class="myclass">(.*)</td>#s',$page,$arr);
        print_r($arr);
?>

returns, as expected:

Array
(
    [0] => <td class="myclass">
            THE
          CONTENT
        </td>
    [1] =>
            THE
          CONTENT

)

The code below is similar to yours but has been modified to cause an identical error. Doesn't seem likely you did this, though. The regexp is modified in order to not match, and the resulting empty array is stored into $arr[0] instead of $arr.

preg_match('#<td class="myclass">(.*)</ td>#s',$page,$arr[0]);

Returns the same error you observe:

Array
(
    [0] => Array
        (
        )

)

I can duplicate the same behaviour you observe (works with </t, does not work with </td>) if I use your regexp, but modify the HTML to have </t d>. I still need to write to $arr[0] instead of $arr if I also want to get an identical output.

Upvotes: 2

pogo

Reputation: 1550

Your regex seems correct. Isn't the syntax of preg_match as follows?

preg_match('/<td class=\"myclass\">(.*)\<\/td>/s',$page,$arr);

The | in the regex represents or

Upvotes: 0