Reputation: 4605
I've got a string with very unclean HTML. Before I parse it, I want to convert this:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>
in NE DEK 143
so it is a bit easier to parse. I've got this regular expression (RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:@"$1 $3 $5"];
I'm no an expert in Regex. Can someone help me out here?
Regards, dodo
Upvotes: 0
Views: 274
Reputation: 81384
Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.
First, strip the tags:
s/<.*?>//
Then collapse all extra spaces into one:
s/\s+/ /
Then remove leading/trailing space:
s/^\s+|\s+$//
Then get the values:
^([^ ]+) ([^ ]+) ([^ ]+)$
Upvotes: 1
Reputation: 579
If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:
Regex r = Regex(@"<\s*font((\s+[^<>]*)|(\s*))>(?<desiredText>[^<>]*)<\s*/\s*font\s*>")
//C# example
foreach(Match m in r.Matches(txt))
result += m.Groups["desiredText"].Value.Trim()
; It will be text enclosed by font-tags without white-space symbols by edges.
Upvotes: 0
Reputation: 336108
I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot .
used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,
but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.
So a search for all occurences of (?m)^[^<>\r\n]$
should find all matches.
Upvotes: 0