dvarney
dvarney

Reputation: 25

Extract value from a multiline pattern using PHP and preg_match

I'm trying to extract a value from a multiline pattern with PHP and preg_match. The pattern I'm searching for within the string I'm passing to preg_match($regex, $string, $the_match):

Latitude:</td>
        <td class="formCell">
        40-45-40.205 N
       </tr>

I know that if it were all on one line like so:

Latitude:</td><td class="formCell">40-45-40.205 N</tr>

Then the following would be valid and it would properly extract the value:

/Latitude:<\/td><td class="formCell">(.*?)<\/tr>/

However, since the pattern I'm looking for has multiple lines the above regex doesn't work. I'm getting the initial string I'm passing to preg_match() via file_get_contents($url) so I'm at the mercy of the remote content to some extent. Any help would be much appreciated!

Upvotes: 1

Views: 1939

Answers (3)

Mitya
Mitya

Reputation: 34576

Use [\s\S] instead of ..

/Latitude:<\/td>[\s]*<td class="formCell">([\s\S]*?)<\/tr>/

. is a wildcard but does not include whitespace - including line break - characters. [\s\S] simply says "match all space and non-space characters" (i.e. anything at all).

Note I also allowed for optional space characters after </td>.

(Sidenote: the HTML is invalid - closing a table row before closing the table cell.)

Upvotes: 6

Chris Trahey
Chris Trahey

Reputation: 18290

I think the trick is to "sprinkle" [\s]* anywhere the HTML formal would legally allow whitespace. You do not need special flags or anything.

Latitude:[\s]*<\/td>[\s]*<td[\s]*class="formCell">[\s]*([\s\S]*?)[\s]*<\/tr>

Keep in mind that html is VERY forgiving about whitespace. You need to evaluate your input and decide what is acceptable tolerance for you.

Another caveat is that these elements may have different attributes, or different quote styles... If you must work with that as well, you will need to use more of . and then use the "unready" flag (add u after the pattern when passing it to the preg functions); and then perhaps some fancy back-referencing once you realize that > can legally occur inside of an attribute ;-)

Upvotes: 0

Jelmer
Jelmer

Reputation: 2693

There is no simple flag for this. A simple hack could be:

Latitude:(.*?)<\/td>(.*?)<td class="formCell">(.*?)<\/tr>

And then add the dotall flag to your regex (s) to allow a '.'[dot] to match newlines also. But then it could match a lot more. Is it your own code or are you ripping html from a 3rd party website? Because maybe you are using regex' when you don't have to!

Upvotes: 0

Related Questions