Reputation: 11
I've been trying to make a tokenizer for a language as an exercise . for example I'm trying to tokenize the code below
num vecsum(vec A)
{
num n;
n = (5 + 2);
return n;
}
and I've been trying to use this regex
re.findall("[\'\w\-]+",text)
but I get outputs like this one : vecsum(vec
while I want to get it like this : ["vecsum" , "(" , "vec" ]
I want it to understand that even if there isn't white space there it should split stuff like " ; " and " ( "
Upvotes: 0
Views: 3015
Reputation: 1122132
Tokenizing a C-like language requires more work than just splitting on whitespace (which is what you are doing right now).
There are at least 3 types of tokens in such languages:
(
, )
, ;
, +
and =
I'm ignoring comments here, which are either defined as running from a starting sequence until the end of a line (# ...
, // ...
, etc.) or from starting sequence to ending sequence any number of lines away (/* .... */
).
You could define a regex that can tokenize the first two types, and then perhaps use the output of that to handle strings (when you get a "
token find the next "
token without a \
token right in front, then take everything in between (whitespace and all) as the string).
Such a tokenizer the needs at least two groups, for single characters and multi-character tokens. The multi-character tokens are a further group of options:
r'(?:[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'
I used operators in C and C++ on Wikipedia as a guide on what single-character tokens to look for.
For your sample input, this produces:
['num', 'vecsum', '(', 'vec', 'A', ')', '{', 'num', 'n', ';', 'n', '=', '(', '5', '+', '2', ')', ';', 'return', 'n', ';', '}']
If you must parse multi-symbol operators as single tokens, then you also must include these as separate patterns in the regex, e.g.
(
r'(?:==|!=|>=|<=|&&|\|\||<<|>>|->|::|\+\+|--|+=|-='
r'|\*=|/=|%=|<<=|>>=|&=|\^=|\|='
r'|[\\(){}[\]=&|^+<>/*%;.\'"?!~-]|(?:\w+|\d+))'
)
but then you are half-way to a full-blown tokenizer that defines patterns for each type of literal and keyword and you may as well start breaking out this huge regex into such constituent parts. See the Python tokenize
module source code for an example of such a tokenizer; it builds up a large regex from component parts to produce typed tokens.
The alternative is to stick with the super-simple 2-part tokenizer regex and use re.finditer()
to make decisions about tokens in context. With a start
and end
position in the string, you can detect that =
was directly preceded by =
, and so know you have the ==
comparison operator rather than two assignments. I used this in a simple parser for a SQLite full-text search query language before, (look for the _terms_from_query()
method in the code for this answer), if you want to see an example.
Upvotes: 2