Reputation: 21220
There are two style of comments , C-style and C++ style, how to recognize them?
/* comments */
// comments
I am feel free to use any methods and 3rd-libraries.
Upvotes: 1
Views: 6737
Reputation: 170308
To reliably find all comments in a Java source file, I wouldn't use regex, but a real lexer (aka tokenizer).
Two popular choices for Java are:
Contrary to popular belief, ANTLR can also be used to create only a lexer without the parser.
Here's a quick ANTLR demo. You need the following files in the same directory:
lexer grammar JavaCommentLexer;
options {
: FSlash FSlash ~('\r' | '\n')*
: FSlash Star .* Star FSlash
: DQuote
( (EscapedDQuote)=> EscapedDQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '"' | '\r' | '\n')
DQuote {skip();}
: SQuote
( (EscapedSQuote)=> EscapedSQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '\'' | '\r' | '\n')
SQuote {skip();}
fragment EscapedDQuote
: BSlash DQuote
fragment EscapedSQuote
: BSlash SQuote
fragment EscapedBSlash
: BSlash BSlash
fragment FSlash
: '/' | '\\' ('u002f' | 'u002F')
fragment Star
: '*' | '\\' ('u002a' | 'u002A')
fragment BSlash
: '\\' ('u005c' | 'u005C')?
fragment DQuote
: '"'
| '\\u0022'
fragment SQuote
: '\''
| '\\u0027'
fragment Unicode
: '\\u' Hex Hex Hex Hex
fragment Octal
: '\\' ('0'..'3' Oct Oct | Oct Oct | Oct)
fragment Hex
: '0'..'9' | 'a'..'f' | 'A'..'F'
fragment Oct
: '0'..'7'
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
JavaCommentLexer lexer = new JavaCommentLexer(new ANTLRFileStream(""));
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
CommonToken t = (CommonToken)o;
if(t.getType() == JavaCommentLexer.SingleLineComment) {
System.out.println("SingleLineComment :: " + t.getText().replace("\n", "\\n"));
if(t.getType() == JavaCommentLexer.MultiLineComment) {
System.out.println("MultiLineComment :: " + t.getText().replace("\n", "\\n"));
\u002f\u002a <- multi line comment start
comment // not a single line comment
public class Test {
// single line "not a string"
String s = "\u005C" \242 not // a comment \\\" \u002f \u005C\u005C \u0022;
regular multi line comment
char c = \u0027"'; // the " is not the start of a string
char q1 = '\u005c''; // == '\''
char q2 = '\u005c\u0027'; // == '\''
char q3 = \u0027\u005c\u0027\u0027; // == '\''
char c4 = '\047';
String t = "/*";
\u002f\u002f another single line comment
String u = "*/";
Now, to run the demo, do:
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp antlr-3.2.jar org.antlr.Tool JavaCommentLexer.g
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ javac -cp antlr-3.2.jar *.java
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp .:antlr-3.2.jar Main
and you'll see the following being printed to the console:
MultiLineComment :: \u002f\u002a <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n\u002A/
SingleLineComment :: // single line "not a string"
SingleLineComment :: // a comment \\\" \u002f \u005C\u005C \u0022;
MultiLineComment :: /*\n regular multi line comment\n */
SingleLineComment :: // the " is not the start of a string
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: \u002f\u002f another single line comment
You can create a sort of lexer with regex yourself, of course. The following demo does not handle Unicode literals inside source files, however:
/* <- multi line comment start
comment // not a single line comment
public class Test2 {
// single line "not a string"
String s = "\" \242 not // a comment \\\" ";
regular multi line comment
char c = '"'; // the " is not the start of a string
char q1 = '\''; // == '\''
char c4 = '\047';
String t = "/*";
// another single line comment
String u = "*/";
import java.util.*;
import java.util.regex.*;
public class Main2 {
private static String read(File file) throws IOException {
StringBuilder b = new StringBuilder();
Scanner scan = new Scanner(file);
while(scan.hasNextLine()) {
String line = scan.nextLine();
return b.toString();
public static void main(String[] args) throws Exception {
String contents = read(new File(""));
String slComment = "//[^\r\n]*";
String mlComment = "/\\*[\\s\\S]*?\\*/";
String strLit = "\"(?:\\\\.|[^\\\\\"\r\n])*\"";
String chLit = "'(?:\\\\.|[^\\\\'\r\n])+'";
String any = "[\\s\\S]";
Pattern p = Pattern.compile(
String.format("(%s)|(%s)|%s|%s|%s", slComment, mlComment, strLit, chLit, any)
Matcher m = p.matcher(contents);
while(m.find()) {
String hit =;
if( != null) {
System.out.println("SingleLine :: " + hit.replace("\n", "\\n"));
if( != null) {
System.out.println("MultiLine :: " + hit.replace("\n", "\\n"));
If you run Main2
, the following is printed to the console:
MultiLine :: /* <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n*/
SingleLine :: // single line "not a string"
MultiLine :: /*\n regular multi line comment\n */
SingleLine :: // the " is not the start of a string
SingleLine :: // == '\''
SingleLine :: // another single line comment
Upvotes: 6
Reputation: 68907
EDIT: I've been searching for a while, but here is the real working regex:
String regex = "((//[^\n\r]*)|(/\\*(.+?)\\*/))"; // New Regex
List<String> comments = new ArrayList<String>();
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(code);
// code is the C-Style code, in which you want to serach
while (m.find())
With this input:
import Blah;
//Comment one//
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
It generates this output:
//Comment one//
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
Notice that the last three lines of the output are one single print.
Upvotes: 3
Reputation: 340993
Have you tried regular expressions? Here is a nice wrap-up with Java example. It might need some tweaking However using only regular expressions won't be sufficient for more complicated structures (nested comments, "comments" in strings) but it is a nice start.
Upvotes: 0