Reputation: 31
I'm looking for a Java based html parser which can search and replace text preserving html tags. This question has been asked here before but the answers seems to be not hitting the target. There are few html parsers which I downloaded and wrote simple programs to see whether they can do the job. These include jsoup, Jericho, Java HTML parser etc. These can do a search but when it comes to replacing text preserving html tags, there is no way to do it.
I have read the complete thread for these posts:
How to find/replace text in html while preserving html tags/structure
html search and replace on server side
If there are no such parser exists today, what is the best way for implementing one? If you have done something like this already, can you share the code?
Upvotes: 2
Views: 1149
Reputation: 120516
The Caja parser uses libhtmlparser, an HTML5 parser that deals well with tag soup containing embedded XML subtrees producing an org.w3c.dom.DocumentFragment
, and has a renderer that produces well formed HTML.
The parser code is at http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/html/DomParser.java
The renderer code is at http://code.google.com/p/google-caja/source/browse/trunk/src/com/google/caja/parser/html/Nodes.java
Upvotes: 1
Reputation: 4122
The Jericho parser might help you. Has been around forever and works with malformed HTML. http://jericho.htmlparser.net/docs/index.html
Upvotes: 1