dododedodonl
dododedodonl

Reputation: 4605

Regular Expression doesn't match

I've got a string with very unclean HTML. Before I parse it, I want to convert this:

<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>

in NE DEK 143 so it is a bit easier to parse. I've got this regular expression (RegexKitLite):

NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>" 
                                                     withString:@"$1 $3 $5"];

I'm no an expert in Regex. Can someone help me out here?

Regards, dodo

Upvotes: 0

Views: 274

Answers (3)

Delan Azabani
Delan Azabani

Reputation: 81384

Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.

First, strip the tags:

s/<.*?>//

Then collapse all extra spaces into one:

s/\s+/ /

Then remove leading/trailing space:

s/^\s+|\s+$//

Then get the values:

^([^ ]+) ([^ ]+) ([^ ]+)$

Upvotes: 1

chapluck
chapluck

Reputation: 579

If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:

Regex r = Regex(@"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
   result += m.Groups["desiredText"].Value.Trim()

; It will be text enclosed by font-tags without white-space symbols by edges.

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336108

I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot . used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,

but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.

So a search for all occurences of (?m)^[^<>\r\n]$ should find all matches.

Upvotes: 0

Related Questions