vash_ace
vash_ace

Reputation: 529

Regex to fetch the correct java class name

For some reason, I want scan the content of java file(e.g. TagMatchingInterface.java) and fetch the class name(TagMatchingInterface) via regex, but my regex match the incorrect class name as there are some key words(class/interface/enum) hiding in the comment:

/**
 *
 * @author XXXX
 * Introduction: A common interface that judges all kinds of algorithm tags.
 * some other comment
 */
public class TagMatchingInterface 
{
  // content
  public class InnerClazz{
    // content
  }
}

here is my pattern:

public Pattern CLASS_PATTERN = Pattern.compile("(?:public\\s)?(?:.*\\s)?(class|interface|enum)\\s+([$_a-zA-Z][$_a-zA-Z0-9]*)");
....
Matcher matcher = CLASS_PATTERN.matcher(content);
if (matcher.find()) {
   System.out.println(match.group(2));
}

Any idea about my regex?

Upvotes: 5

Views: 3077

Answers (2)

MvG
MvG

Reputation: 60958

… there are some key words(class/interface/enum) hiding in the comment:

Then get rid of all comments first. Suitable regular expressions should be fairly easy to write. I suggest you eliminate both kinds of comments (single line and multiple lines) simultaneously, in case there is text in one that looks like the start of the other.

Also get rid of all strings, while you are at it, since you might have a string in an annotation preceding the class.

"(?:public\\s)?(?:.*\\s)?(class|interface|enum)\\s+([$_a-zA-Z][$_a-zA-Z0-9]*)"

Checking for public serves little purpose, since the part without it would match just as well. In fact only the later parts will match if one of the class modifiers like final or abstract follows public.

So if you want to know whether the class is in fact public, you have to check for these, too. Which will be tricky since you may have annotations with parenthesized arguments nested to arbitrary depth. That's something a regular expression can't handle correctly.

What about classes containing non-ASCII letters in their name? What about unicode escapes in the input?

Upvotes: 1

Ro Yo Mi
Ro Yo Mi

Reputation: 15010

Description

(?<=\n|\A)(?:public\s)?(class|interface|enum)\s([^\n\s]*)

Regular expression visualization

This regex does the following:

  • allow the string to start with public or not
  • be a class or interface or enum
  • capture the name

Note, I recommend using the global and case insensitive flags

Example

Live Example

https://regex101.com/r/vR0iK3/1

Sample Text

/**
 *
 * @author XXXX
 * Introduction: A common interface that judges all kinds of algorithm tags.
 * some other comment
 */
public class TagMatchingInterface 
{
  // content
  public class InnerClazz{
    // content
  }
}

Sample Matches

[0][0] = public class TagMatchingInterface
[0][1] = class
[0][2] = TagMatchingInterface

Capture groups:

  • group 0 gets the entire match
  • group 1 gets the class
  • group 2 gets the name

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?<=                     look behind to see if there is:
----------------------------------------------------------------------
    \n                       '\n' (newline)
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    \A                        Start of the string
----------------------------------------------------------------------
  )                        end of look-behind
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    public                   'public'
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    class                    'class'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    interface                'interface'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    enum                     'enum'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [^\n\s]*                 any character except: '\n' (newline),
                             whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------

Upvotes: 7

Related Questions