How to create a regex for tokenizing Java source code in Python

Hello guys I've been working on an interesting project involving some ML in python and some Java source code. Basically I need to tokenize each line of Java code with regular expressions and sadly I haven't been able to do that.

I've been trying to create my own regular expression pattern for the last couple of days with lots of googling and youtubing because I didn't know how to do it myself in the begging(I don't think do now either :( ). I tried using libraries for tokenizing but those work in really weird ways like sometimes missiing semi-colons and brackets and sometimes not.

def stringTokenizer(string):
    tokens = re.findall(r"[\w']+|[""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""\\]", string);
    print(tokens);

stringTokenizer('void addAction(String commandId, IHandler action);');

Initially I wanted the to get the following output: ['void', 'addAction', '(', 'String', 'commandId', 'IHandler', 'action', ')', ';'] but saddly this is the closest I got to the result ['void', 'addAction(', 'String', 'commandId', 'IHandler', 'action);']

If anybody could help you'll be a lifesaver.

Upvotes: 2

Views: 755

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You want to match chunks of 1+ word or single apostrophe chars or single occurrences of all other chars except for whitespace.

Thus, you need

re.findall(r"[\w']+|[^\w\s']", s)

You probably might consider using this expression when you need to match ' between word chars into word chunks:

re.findall(r"\w+(?:'\w+)*|[^\w\s]", s)
             ^^^^^^^^^^^^

See the regex demo and the regex graph:

enter image description here

Details

  • [\w']+ - a positive character class that matches one or more word chars (letters, digits, underscores, some more rare chars that are considered "word")
  • | - or
  • [^\w\s'] - a negated character class that matches any 1 char other than word, whitespace chars and single apostrophes.
  • \w+(?:'\w+)* matches 1+ word chars followed with 0 or more repetitions of ' and 1+ word chars.

Upvotes: 1

Related Questions