Remove all HTML from java server pages

Question

Anyone know some way to remove all HTML from a JavaServer Page. Keeping only Java Code and all JSP properties.

I search for that but don't find any way to do this. The only way that I know the will works it's to create a parser for jsp and after that analyse AST to keep nodes that matter, but this solution is painful.

If anyone knows a way to do this in a easy way please let me know, otherwise if you know that the parser is the only way possible I appreciate too.

EDIT:

I need this to count the number of lines that contain Java code or JSP properties in every JSP.

Ira Baxter · Accepted Answer

You can't do this easily because HTML and JSP are both rich structures, both in terms of atoms (lexemes) and more complex constructs (tables, statements, ...) A full up parser which recognizes all those structures would do the trick. If you can get such a parser, then that's an easy way to go.

But if you only want physical line counts of HTML vs JSP, then you only need the part of the parser necessary for this task. In particular, you don't need all the construct recognition machinery; just the part that recognizes the atoms, e.g., just the lexical part of the parsing engine.

You can do this by defining lexers for each type of syntax (e.g HTML and JSP) that pass control to one another as transitions between them are encountered. This is a very standard task modulo sweat equity. Then line counting is pretty straightforward; each recognized lexeme records its starting and ending line, and that give the raw data necessary.

Building the lexers for HTML and JSP isn't technically hard, but it can be lot of work ("painful" is how you put it). HTML in particular has gotten pretty complex over the years, and JSP now presumably includes most of Java7 as a subset.

If you can get such a parser, for the physical line count, in fact you should be able to extract just the lexer part. But it is probably easier just to use the parser unchanged.

If you ever decide you want measure more complex properties of the JSP pages (e.g., nesting depth of HTML constructs, logical statement counts, code coupling, you won't have a choice; you'll really need the parser because these measures are based on the complex structure of the langauge constructs and not just the lexemes.

There are likely open source JSP parsers available. Certainly web servers that execute JSP must contain such parsers; check out the guts of Tomcat. You'll have to extract the parser from the web server, and that is likely to be some work. I know there are commercial JSP parsers intended to support exactly this kind of task (my company has one).

If you just want the counts, and you don't want the work, you can get a tool that already has this metrics collection built-in. See my company's Source Code Search Engine (SCSE) product, which produces SLOC, McCabe and Cyclometric measures on files as a byproduct of its code-indexing step. The SCSE uses the JSP parser we have to achieve this effect, out of the box.

Remove all HTML from java server pages

Answers (1)

Related Questions