Mahdi Talebi
Mahdi Talebi

Reputation: 11

regex - making a tokenizer using regular expressions in Python 3

I've been trying to make a tokenizer for a language as an exercise . for example I'm trying to tokenize the code below

num vecsum(vec A)
{
num n;
n = (5 + 2);
return n;
}

and I've been trying to use this regex

re.findall("[\'\w\-]+",text)

but I get outputs like this one : vecsum(vec

while I want to get it like this : ["vecsum" , "(" , "vec" ]

I want it to understand that even if there isn't white space there it should split stuff like " ; " and " ( "

Upvotes: 0

Views: 3015

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122132

Tokenizing a C-like language requires more work than just splitting on whitespace (which is what you are doing right now).

There are at least 3 types of tokens in such languages:

  • Single character tokens, such as (, ), ;, + and =
  • Multi-character tokens, identifiers, keywords, numbers.
  • Strings; everything between an opening quote and the matching closing quote (with support for escapes, and varying degrees of support for special kinds of strings that could, say, contain newlines).

I'm ignoring comments here, which are either defined as running from a starting sequence until the end of a line (# ..., // ..., etc.) or from starting sequence to ending sequence any number of lines away (/* .... */).

You could define a regex that can tokenize the first two types, and then perhaps use the output of that to handle strings (when you get a " token find the next " token without a \ token right in front, then take everything in between (whitespace and all) as the string).

Such a tokenizer the needs at least two groups, for single characters and multi-character tokens. The multi-character tokens are a further group of options:

r'(?:[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'

I used operators in C and C++ on Wikipedia as a guide on what single-character tokens to look for.

For your sample input, this produces:

['num', 'vecsum', '(', 'vec', 'A', ')', '{', 'num', 'n', ';', 'n', '=', '(', '5', '+', '2', ')', ';', 'return', 'n', ';', '}']

If you must parse multi-symbol operators as single tokens, then you also must include these as separate patterns in the regex, e.g.

(
    r'(?:==|!=|>=|<=|&&|\|\||<<|>>|->|::|\+\+|--|+=|-='
    r'|\*=|/=|%=|<<=|>>=|&=|\^=|\|='
    r'|[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'
)

but then you are half-way to a full-blown tokenizer that defines patterns for each type of literal and keyword and you may as well start breaking out this huge regex into such constituent parts. See the Python tokenize module source code for an example of such a tokenizer; it builds up a large regex from component parts to produce typed tokens.

The alternative is to stick with the super-simple 2-part tokenizer regex and use re.finditer() to make decisions about tokens in context. With a start and end position in the string, you can detect that = was directly preceded by =, and so know you have the == comparison operator rather than two assignments. I used this in a simple parser for a SQLite full-text search query language before, (look for the _terms_from_query() method in the code for this answer), if you want to see an example.

Upvotes: 2

Related Questions