Garoal
Garoal

Reputation: 2374

iPhone parse xhtml + css

I have a complex long XHTML file, which contains CSS. Searching on google and on this site, I've found some libraries that can be useful on XHTML parsing:

However, I'm wondering if there is any library for iPhone that can convert a xhtml + css document to a NSAttributedString (only the text, of course).

I have been thinking on that problem, and I have had some ideas, but I think it won't be very efficient. My main idea is formed by this steps:

I know that this is complex, and I don't need you to provide the code (of course, if you provide it, it would be great), I only want the link to a library or, if it doesn't exist, some advice for create a parser myself.

Of course, if you need some more information, ask by comments.

Thanks you!!

Upvotes: 0

Views: 890

Answers (2)

xingzhi.sg
xingzhi.sg

Reputation: 433

My way to parse an HTML string into NSAttributedString is to recursively append parsed node (and its childNodes) into an NSMutableAttributedString.

I am not ready to publish my full code anywhere yet. But hopefully this can give you some hints...

NSString+HTML.h

/*  - toHTMLElements
 *  parse the string itself into a dictionary collection of htmlelements for following keys
 *  : @"attributedString"   // html main body
 *  : @"insets"         // images and/or videos with range info
 *  : @"as"             // href with range info
 *  
 */

- (NSMutableDictionary*) toHTMLElements;

NSString+HTML.m

- (NSMutableDictionary*) toHTMLElements {

    // …
    // handle escape encoding here
    // assume that NSString* htmlString is the processed string;
    // …


    NSMutableDictionary * htmlElements = [[NSMutableDictionary dictionary] retain];

    NSMutableAttributedString * attributedString = [[[NSMutableAttributedString alloc] init] autorelease];
    NSMutableArray * insets = [NSMutableArray array];
    NSMutableArray * as     = [NSMutableArray array];

    [htmlElements setObject:attributedString forKey:HTML_ATTRIBUTEDSTRING];
    [htmlElements setObject:insets forKey:HTML_INSETS];
    [htmlElements setObject:as forKey:HTML_AS];


    // parse the HTML with an XML parser
    // CXXML is a variance of TBXML (http://www.tbxml.co.uk/ ) which can handle the inline tags such as <span>
    // code not available to public yet, so write your own inline-tag-enabled HTML/XML parser.

    CXXML * xml = [CXXML tbxmlWithXMLString:htmlString];
    TBXMLElement * root = xml.rootXMLElement;

    TBXMLElement * next = root->firstChild;

    while (next != nil) {
        //
        // do something here for special treatments if needed
        //
        NSString * tagName = [CXXML elementName:next];

        [self appendXMLElement:next withAttributes:[HTMLElementAttributes defaultAttributesFor:tagName] toHTMLElements:htmlElements];

        next = next->nextSibling;
    }

    return [htmlElements autorelease];
}

- (void) appendXMLElement:(TBXMLElement*)aElement withAttributes:(NSDictionary*)parentAttributes toHTMLElements:(NSMutableDictionary*) htmlElements {

    // do your parse of aElement and its attribute values, 
    // assume NSString * tagAttrString is the parsed html attribute string (either from "style" attribute or css file) for this tag like : width:200px; color:#123456; 
    // let an external HTMLElementAttributes class to handle the attribute updates from the parent node's attributes

    NSDictionary * tagAttr = [HTMLElementAttributes updateAttributes: parentAttributes withCSSAttributes:tagAttrString];

    // create your NSAttributedString styled by tagAttr
    // create insets such as images / videos or hyper links objects
    // then update the htmlElements for storage

    // once this tag is handled, recursively visit and process the current tag's children

    TBXMLElement * nextChild = aElement->firstChild;

    while (nextChild != nil) {
        [self appendXMLElement:nextChild withAttributes:tagAttr toHTMLElements:htmlElements];
        nextChild = nextChild->nextSibling;
    }
}

Upvotes: 1

Daniel Eggert
Daniel Eggert

Reputation: 6715

It depends on your needs if this will do what you want, but DTCoreText has an HTML -> NSAttributedString converter. It's very specific for what DTCoreText wants to / needs to do, but it might at least point you in the right direction.

Upvotes: 2

Related Questions