Reputation: 2468
I have a corrupt html-page which i unfortunately can't parse with xml/xcode so i came up with regex. I'm a regexbeginner but I cant get the right result.
Source
<td>FIELD:</td> <td>VALUE<td>
I want to get the value and this is where I'm stuck
$regex = '{<td[^>]*<td>(.*?)</td>}';
edit: as a result I want an array where I can reach the value, so I'm just interested in the value
I'm thankfull for every hint.
cheers endo
Upvotes: 0
Views: 205
Reputation: 30760
There are some immediately visible problems with your regex; for example, <td[^>]*<td>
doesn't do what you think it does. But rather than suggest a different regex, let me urge you to do the sanest thing:
Trust me. Don't do it. Others will come in here and suggest new regex patterns, and their patterns will all be wrong. Regex isn't even up to the task of parsing clean HTML/XML, so trying to use it on arbitrarily corrupted code is a recipe for madness. Try HTML Tidy, which is made for this sort of thing. Depending on what's wrong with the HTML, a parser like HtmlPurifier or Beautiful Soup might also be able to work with it.
It may seem like a little more effort, but you'll save yourself time in the long run.
Upvotes: 0
Reputation: 13510
Try this:
'{<td>.*?</td>\s+<td>(.*?)</td>}'
But you missed a /
in the html text
If, by corrupted, you mean missing slashes at closing tags, you can use this:
'{<td>.*?</?td>\s+<td>(.*?)</?td>}'
where the slashes in closing tags are now optional
Upvotes: 1