kartik
kartik

Reputation:

Parsing an HTML file using Java

How can remove the comments and contents of the comments from an html file using Java where the comments are written like:

<!--

Any idea or help needed on this.

Upvotes: 3

Views: 1414

Answers (3)

Kees de Kooter
Kees de Kooter

Reputation: 7195

Take a look at JTidy, the java port of HTML Tidy. You could override the print methods of the PPrint object to ignore the comment tags.

Upvotes: 5

Daniel Hiller
Daniel Hiller

Reputation: 3505

If you don't have valid xhtml, which a comment posted reminded me of, you should at first apply jtidy to tidy up the html and make it valid xhtml.

See this for example code on jtidy.

Then I'd convert the html to a DOM instance.

Like so:

final DocumentBuilderFactory newFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder documentBuilder = newFactory.newDocumentBuilder();
Document document = documentBuilder.parse( new InputSource( new StringReader( string ) ) );

Then I'd navigate through the document tree and modify nodes as needed.

Upvotes: 4

cobbal
cobbal

Reputation: 70775

try a simple regex like

String commentless = pageString.replaceAll("<!--[\w\W]*?-->", "");

edit: to explain the regex:

  • <!-- matches the literal comment start
  • [\w\W] matches every character (even newlines) which will be inside the comment
  • *? matches multiple of the 'any character' but matches the smallest amount possible (not greedy)
  • --> closes the comment

Upvotes: 0

Related Questions