Reputation:
How can remove the comments and contents of the comments from an html file using Java where the comments are written like:
<!--
Any idea or help needed on this.
Upvotes: 3
Views: 1414
Reputation: 7195
Take a look at JTidy, the java port of HTML Tidy. You could override the print methods of the PPrint object to ignore the comment tags.
Upvotes: 5
Reputation: 3505
If you don't have valid xhtml, which a comment posted reminded me of, you should at first apply jtidy to tidy up the html and make it valid xhtml.
See this for example code on jtidy.
Then I'd convert the html to a DOM instance.
Like so:
final DocumentBuilderFactory newFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = newFactory.newDocumentBuilder();
Document document = documentBuilder.parse( new InputSource( new StringReader( string ) ) );
Then I'd navigate through the document tree and modify nodes as needed.
Upvotes: 4
Reputation: 70775
try a simple regex like
String commentless = pageString.replaceAll("<!--[\w\W]*?-->", "");
edit: to explain the regex:
<!--
matches the literal comment start[\w\W]
matches every character (even newlines) which will be inside the comment*?
matches multiple of the 'any character' but matches the smallest amount possible (not greedy)-->
closes the commentUpvotes: 0