BrianHobbs
BrianHobbs

Reputation: 127

recommendations for a java HTML parser/editor

I've been running into problem after problem trying to use the a third party HTML editor to do what (I hoped) was a simple operation. Because of these problems, I'm looking for recommendations for an alternative HTML parser I could use to perform the operations.

Here's my situation, I have span tags in my html (with an ID attribute to identify them) and I simply want to replace their contents based on an update in another area of my client. For example:

<html>
    <body>
        <p>Hello <span id="1">name</span> you are <span id="2">age</span></p>
    </body>
</html>

I've been trying to use the HTMLDocument class in javax.swing.text like this:

Element e;
e = doc.getElement(document.getDefaultRootElement(), Attribute.ID, "1");
document.setInnerHTML(element, "John");
e = doc.getElement(document.getDefaultRootElement(), Attribute.ID, "2");
document.setInnerHTML(element, "99");

but the element returned is a leaf element and won't allow the innerHTML to be set. Unfortunately, the document, reader & parser are all supplied by a 3rd party & so I can't really modify it.

So, what I was hoping for was that someone else has had a similar problem and could recommend an alternative library to do this?

Thanks in advance, B.

Upvotes: 2

Views: 2178

Answers (5)

stwissel
stwissel

Reputation: 20384

I used JTidy very successfully. It takes in HTML, removes out the crap, so you have a proper DOM object and then simply use XPath to alter your targets.

Upvotes: 0

HerdplattenToni
HerdplattenToni

Reputation: 494

Can you really not accomplish that with java.swing.text.HTMLDocument?

I have never tried this but reading through the API something along the line of

document.replace(e.getStartOffset(), e.getEndOffset()-e.getStartOffset(), "John", null)

instead of using setInnerHtml() could work.

Upvotes: 2

Steven Huwig
Steven Huwig

Reputation: 20794

I'm having good luck on my current project with TagSoup.

Upvotes: 0

Prabhu R
Prabhu R

Reputation: 14242

HTMLParser is a great library but is LGPL, which might not be suitable for some commercial projects.

If your html is well-formed then you can go in for Dom4J to traverse through the nodes, and in case if your HTML is not well formed you can use Tidy in conjunction with Dom4J

Upvotes: 0

kgiannakakis
kgiannakakis

Reputation: 104198

Have you tried HTML Parser? It is a robust, open source HTML parsing library for Java.

Upvotes: 0

Related Questions