
Reputation: 2151

remove html entities from NSString - solution working on both iOS simulator and device

I have html stored in NSString. I download from internet and parse it using NSXMLParser. It seems however that it has problems with entities such as ó, „, ’ etc. Quite a big problem acutally, because it just tells me it failed and stops parsing any further.

I found some good solutions for this in different topics here on stackoverflow, but they recommended using NSString + HTML category or Google Toolbox for Mac (NSString category uses GTM). I already had projects that used GTM and it made running my app on iOS simulator impossible, so I'd like to avoid that.

Upvotes: 0

Views: 366

Answers (1)


Reputation: 4433

It’s probably easiest just to write your own method for this, using NSScanner. Note that the entity syntax is a little more complex than just a list of replacements; namely, you will need to support:

  • &#D; where D is a decimal number
  • &#x*H*; where H is a hexadecimal number (upper and lower case are both OK)

and then you’ll need a table of mappings for the named entities (there’s a list in the HTML 4 specification).

Here’s some code (written in Stack Overflow, untested) to get you started:

static NSDictionary *entityDict;

if (!entityDict)
  entityDict = loadEntityMappingTable();

NSScanner *scanner = [NSScanner scannerWithString:myHTMLString];
NSMutableString *result = [NSMutableString string];

[scanner setCharactersToBeSkipped:nil]; // Don’t skip whitespace

while (![scanner isAtEnd]) {
  NSString *chunk, *name;

  if ([scanner scanUpToString:@"&" intoString:chunk])
    [result appendString:chunk];

  if ([scanner scanString:@"#" intoString:NULL]) {
    unsigned uch;
    NSUInteger scanLoc;
    BOOL hex = NO;

    // This is a numeric reference
    if ([scanner scanString:@"x" intoString:NULL]) {
      hex = YES;
      scanLoc = [scanner scanLocation];
      if (![scanner scanHexInt:&uch]) {
        // If we fail, show the entire thing in the result string
        [result appendString:@"&#x"];
    } else {
      int ich;
      scanLoc = [scanner scanLocation];
      if (![scanner scanInt:&ich]) {
        // If we fail, show the entire thing
        [result appendString:@"&#"];

      if (ich < 0) {
        // Bad Unicode code point
        [result appendString:@"&#"];
        [scanner setScanLocation:scanLoc];

      uch = (unsigned)ich;

    // You may also care to prohibit control codes (depending on your application)
    // i.e. uch < 0x20 || uch >= 0x7f && uch < 0xa0

    if (uch >= 0xd800 && uch <= 0xdfff || uch > 0x10ffff) {
      // Bad Unicode code point; show it in the result
      [result appendString:hex ? @"&#x" : @"&#"];
      [scanner setScanLocation:scanLoc];

    if (![scanner scanString:@";" intoString:NULL]) {
      // Unterminated; show it in the result
      [result appendString:hex ? @"&#x" : @"&#"];
      [scanner setScanLocation:scanLoc];

    if (uch < 0xffff)
      [result appendFormat:@"%C", uch];
    else {
      unichar lo, hi;

      hi = 0xd800 | (uch >> 10);
      lo = 0xdc00 | (uch & 0x3ff);

      [result appendFormat:@"%C%C", hi, lo];


  if ([scanner scanUpToString:@";" intoString:&name]) {
    NSString *ch;

    if (![scanner scanString:@";" intoString:NULL]) {
      // Unterminated; show it in the result
      [result appendFormat:@"&%@", name];

    ch = [entityDict objectForKey:[name lowercaseString]];

    if (!ch) {
      // Unrecognised; show it in the result
      [result appendFormat:@"&%@;", name];

    [result appendString:ch];

Stick that in a function or method somewhere, implement loadEntityMappingTable() to initialise the dictionary of mappings and it should work.

FWIW, this same general approach, using a loop and an NSScanner, is easy to apply to lots of similar problems that in scripting languages might be dealt with using regular expression matching.

Upvotes: 1

Related Questions