Reputation: 115
Hello guys I've been working on an interesting project involving some ML in python and some Java source code. Basically I need to tokenize each line of Java code with regular expressions and sadly I haven't been able to do that.
I've been trying to create my own regular expression pattern for the last couple of days with lots of googling and youtubing because I didn't know how to do it myself in the begging(I don't think do now either :( ). I tried using libraries for tokenizing but those work in really weird ways like sometimes missiing semi-colons and brackets and sometimes not.
def stringTokenizer(string):
tokens = re.findall(r"[\w']+|[""!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~""\\]", string);
print(tokens);
stringTokenizer('void addAction(String commandId, IHandler action);');
Initially I wanted the to get the following output: ['void', 'addAction', '(', 'String', 'commandId', 'IHandler', 'action', ')', ';'] but saddly this is the closest I got to the result ['void', 'addAction(', 'String', 'commandId', 'IHandler', 'action);']
If anybody could help you'll be a lifesaver.
Upvotes: 2
Views: 755
Reputation: 627082
You want to match chunks of 1+ word or single apostrophe chars or single occurrences of all other chars except for whitespace.
Thus, you need
re.findall(r"[\w']+|[^\w\s']", s)
You probably might consider using this expression when you need to match '
between word chars into word chunks:
re.findall(r"\w+(?:'\w+)*|[^\w\s]", s)
^^^^^^^^^^^^
See the regex demo and the regex graph:
Details
[\w']+
- a positive character class that matches one or more word chars (letters, digits, underscores, some more rare chars that are considered "word")|
- or [^\w\s']
- a negated character class that matches any 1 char other than word, whitespace chars and single apostrophes.\w+(?:'\w+)*
matches 1+ word chars followed with 0 or more repetitions of '
and 1+ word chars.Upvotes: 1