Reputation: 2151
I have html stored in NSString
. I download from internet and parse it using NSXMLParser
. It seems however that it has problems with entities such as ó
, „
, ’
etc. Quite a big problem acutally, because it just tells me it failed and stops parsing any further.
I found some good solutions for this in different topics here on stackoverflow, but they recommended using NSString + HTML category or Google Toolbox for Mac (NSString category uses GTM). I already had projects that used GTM and it made running my app on iOS simulator impossible, so I'd like to avoid that.
Upvotes: 0
Views: 366
Reputation: 4433
It’s probably easiest just to write your own method for this, using NSScanner
. Note that the entity syntax is a little more complex than just a list of replacements; namely, you will need to support:
and then you’ll need a table of mappings for the named entities (there’s a list in the HTML 4 specification).
Here’s some code (written in Stack Overflow, untested) to get you started:
static NSDictionary *entityDict;
if (!entityDict)
entityDict = loadEntityMappingTable();
NSScanner *scanner = [NSScanner scannerWithString:myHTMLString];
NSMutableString *result = [NSMutableString string];
[scanner setCharactersToBeSkipped:nil]; // Don’t skip whitespace
while (![scanner isAtEnd]) {
NSString *chunk, *name;
if ([scanner scanUpToString:@"&" intoString:chunk])
[result appendString:chunk];
if ([scanner scanString:@"#" intoString:NULL]) {
unsigned uch;
NSUInteger scanLoc;
BOOL hex = NO;
// This is a numeric reference
if ([scanner scanString:@"x" intoString:NULL]) {
hex = YES;
scanLoc = [scanner scanLocation];
if (![scanner scanHexInt:&uch]) {
// If we fail, show the entire thing in the result string
[result appendString:@"&#x"];
continue;
}
} else {
int ich;
scanLoc = [scanner scanLocation];
if (![scanner scanInt:&ich]) {
// If we fail, show the entire thing
[result appendString:@"&#"];
continue;
}
if (ich < 0) {
// Bad Unicode code point
[result appendString:@"&#"];
[scanner setScanLocation:scanLoc];
continue;
}
uch = (unsigned)ich;
}
// You may also care to prohibit control codes (depending on your application)
// i.e. uch < 0x20 || uch >= 0x7f && uch < 0xa0
if (uch >= 0xd800 && uch <= 0xdfff || uch > 0x10ffff) {
// Bad Unicode code point; show it in the result
[result appendString:hex ? @"&#x" : @"&#"];
[scanner setScanLocation:scanLoc];
continue;
}
if (![scanner scanString:@";" intoString:NULL]) {
// Unterminated; show it in the result
[result appendString:hex ? @"&#x" : @"&#"];
[scanner setScanLocation:scanLoc];
continue;
}
if (uch < 0xffff)
[result appendFormat:@"%C", uch];
else {
unichar lo, hi;
hi = 0xd800 | (uch >> 10);
lo = 0xdc00 | (uch & 0x3ff);
[result appendFormat:@"%C%C", hi, lo];
}
continue;
}
if ([scanner scanUpToString:@";" intoString:&name]) {
NSString *ch;
if (![scanner scanString:@";" intoString:NULL]) {
// Unterminated; show it in the result
[result appendFormat:@"&%@", name];
continue;
}
ch = [entityDict objectForKey:[name lowercaseString]];
if (!ch) {
// Unrecognised; show it in the result
[result appendFormat:@"&%@;", name];
continue;
}
[result appendString:ch];
}
}
Stick that in a function or method somewhere, implement loadEntityMappingTable()
to initialise the dictionary of mappings and it should work.
FWIW, this same general approach, using a loop and an NSScanner
, is easy to apply to lots of similar problems that in scripting languages might be dealt with using regular expression matching.
Upvotes: 1