user234490
user234490

Reputation:

Parsing dirty HTML on iPhone

I already searched a long time for a good solution, but I can't find anything that fits my needs...

I want to parse an HTML file and display its content in a table. Everything is almost like writing yet another RSS feed reader. Doing that by parsing valid XML files is simple and straight forward using NSXMLParser or TouchXML or libxml directly or some other XML parseres out there... But these frameworks either only work with XML and/or are not working with non-tidy HTML. The site consists of divs including links that include images or paragraphs including links and images etc. etc... just a normal website. Using libxml seems way too complicated in that case.

Does somebody have more experience with parsing dirty HTML pages? Which (free) library/framework did you use? I have the feeling that I just miss something obvious here. It can't be that difficult to parse HTML files, or not?

I hope you can point me to the right direction!

Upvotes: 2

Views: 2711

Answers (5)

Rengers
Rengers

Reputation: 15228

I had to do this some time ago. Eventually I ended up using HTML Tidy to clean up the HTML before parsing it using TouchXML.

When I did this, the HTML Tidy docs weren't very clear (IMHO) so I had to dig around a bit to find out how it actually worked. If don't have much time now but I can look up the code I came up with if you want.

The source (and more) of HTML Tidy can be found here. http://tidy.sourceforge.net/

Upvotes: 1

If you need to parse most of the page, trying to use libXML2 as per Anurag is a good idea.

If you just want small segments of data from the file, you are better off using RegEx expressions to read out data - there's also a built-in regex library, which you can use the wrapper RegExKitLite to access.

Upvotes: 1

Anurag
Anurag

Reputation: 141879

Checkout the libxml2 library that's also on iPhone and comes with an inbuilt HTML parser. Claims to handle real-world HTML:

this module implements an HTML 4.0 non-verifying parser with API compatible with the XML parser ones. It should be able to parse "real world" HTML, even if severely broken from a specification point of view.

Upvotes: 1

BastiBen
BastiBen

Reputation: 19870

WebKit should handle dirty HTML and allows you to access the DOM tree using the "Page" and "Frame" classes. Those contain functions to find elements by ID and so on.

Upvotes: 1

Nicolás
Nicolás

Reputation: 7523

I have zero experience but... Can't you use WebKit's parser? I guess it should expose some kind of DOM without necessarily having to render the page.

Upvotes: 0

Related Questions