Reputation: 529
For some reason, I want scan the content of java file(e.g. TagMatchingInterface.java) and fetch the class name(TagMatchingInterface) via regex, but my regex match the incorrect class name as there are some key words(class/interface/enum) hiding in the comment:
/**
*
* @author XXXX
* Introduction: A common interface that judges all kinds of algorithm tags.
* some other comment
*/
public class TagMatchingInterface
{
// content
public class InnerClazz{
// content
}
}
here is my pattern:
public Pattern CLASS_PATTERN = Pattern.compile("(?:public\\s)?(?:.*\\s)?(class|interface|enum)\\s+([$_a-zA-Z][$_a-zA-Z0-9]*)");
....
Matcher matcher = CLASS_PATTERN.matcher(content);
if (matcher.find()) {
System.out.println(match.group(2));
}
Any idea about my regex?
Upvotes: 5
Views: 3077
Reputation: 60958
… there are some key words(class/interface/enum) hiding in the comment:
Then get rid of all comments first. Suitable regular expressions should be fairly easy to write. I suggest you eliminate both kinds of comments (single line and multiple lines) simultaneously, in case there is text in one that looks like the start of the other.
Also get rid of all strings, while you are at it, since you might have a string in an annotation preceding the class.
"(?:public\\s)?(?:.*\\s)?(class|interface|enum)\\s+([$_a-zA-Z][$_a-zA-Z0-9]*)"
Checking for public
serves little purpose, since the part without it would match just as well. In fact only the later parts will match if one of the class modifiers like final
or abstract
follows public
.
So if you want to know whether the class is in fact public, you have to check for these, too. Which will be tricky since you may have annotations with parenthesized arguments nested to arbitrary depth. That's something a regular expression can't handle correctly.
What about classes containing non-ASCII letters in their name? What about unicode escapes in the input?
Upvotes: 1
Reputation: 15010
(?<=\n|\A)(?:public\s)?(class|interface|enum)\s([^\n\s]*)
This regex does the following:
public
or notclass
or interface
or enum
Note, I recommend using the global and case insensitive flags
Live Example
https://regex101.com/r/vR0iK3/1
Sample Text
/**
*
* @author XXXX
* Introduction: A common interface that judges all kinds of algorithm tags.
* some other comment
*/
public class TagMatchingInterface
{
// content
public class InnerClazz{
// content
}
}
Sample Matches
[0][0] = public class TagMatchingInterface
[0][1] = class
[0][2] = TagMatchingInterface
Capture groups:
NODE EXPLANATION
----------------------------------------------------------------------
(?<= look behind to see if there is:
----------------------------------------------------------------------
\n '\n' (newline)
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
\A Start of the string
----------------------------------------------------------------------
) end of look-behind
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
public 'public'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
class 'class'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
interface 'interface'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
enum 'enum'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^\n\s]* any character except: '\n' (newline),
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
Upvotes: 7