endo.anaconda
endo.anaconda

Reputation: 2468

RegEx in HTML split with preg-match

I have a corrupt html-page which i unfortunately can't parse with xml/xcode so i came up with regex. I'm a regexbeginner but I cant get the right result.

Source

<td>FIELD:</td> <td>VALUE<td>

I want to get the value and this is where I'm stuck

$regex = '{<td[^>]*<td>(.*?)</td>}';

edit: as a result I want an array where I can reach the value, so I'm just interested in the value

I'm thankfull for every hint.

cheers endo

Upvotes: 0

Views: 205

Answers (2)

Justin Morgan
Justin Morgan

Reputation: 30760

There are some immediately visible problems with your regex; for example, <td[^>]*<td> doesn't do what you think it does. But rather than suggest a different regex, let me urge you to do the sanest thing:

Don't use regex for this!

Trust me. Don't do it. Others will come in here and suggest new regex patterns, and their patterns will all be wrong. Regex isn't even up to the task of parsing clean HTML/XML, so trying to use it on arbitrarily corrupted code is a recipe for madness. Try HTML Tidy, which is made for this sort of thing. Depending on what's wrong with the HTML, a parser like HtmlPurifier or Beautiful Soup might also be able to work with it.

It may seem like a little more effort, but you'll save yourself time in the long run.

Upvotes: 0

Israel Unterman
Israel Unterman

Reputation: 13510

Try this:

'{<td>.*?</td>\s+<td>(.*?)</td>}'

But you missed a / in the html text If, by corrupted, you mean missing slashes at closing tags, you can use this:

'{<td>.*?</?td>\s+<td>(.*?)</?td>}' where the slashes in closing tags are now optional

Upvotes: 1

Related Questions