How to tokenize cpp source?

Question

For example, there is a cpp source code like:

 #include
 using namespace std;
 int main()
 {
 int counta=0;
 int countb=0;
 while(cin.get()!='*')
 count++;
 cout<



tokens like:

head_iostream
namesp_std
begin_main
var_int
var_int
begin_while,cin_char,var_char!=’*’
assign,var_int++
end_while


How can you make it? Are there any relative tools or open source available?

Ira Baxter · Accepted Answer

Basically what you do is code a finite-state machine (FSM) that processes a stream of input characters from the source file. Each state transitions on individual characters read from the stream.

The FSM start state is "don't know what token comes next". Intermediate states are "according to what characters have been seen since the start state, the current token might be any of the set X,Y,Z... [specific to the state]". Leaf states are "the current token is a complete X" or "the characters since start don't correspond to any valid token". Note that having a "complete X" doesn't mean that more characters cannot be read to extend X or become a Y.

One might handle whitespace/comments as additional branches from the FSM start state, that simply end up back at the FSM start state at the end of the whitespace sequence.

You can code such a finite statement machine for limited characters sets (e.g., ASCII) pretty easily by hand; this is often done for compilers because such FSMs process input streams really fast and compilers see a lot of source code characters. Making this fast tends to help make a faster compiler.

It is often useful for the various states to store transitioned characters into a token buffer, so then when a leaf state is reached, the token content has been captured. One can do this conditionally based on the current intermediate state; there's no need to store such characters if the current state can only lead to a keyword, for example.

You can generate such FSMs using a variety of tools called "lexer generators" (Lex, Yacc, Flex, many others). Normally these tools take a list of pairs of

and build a unified FSM that integrates the effect of the regular expressions, and produces leaf states for each token_name. Such generated lexers can be quite fast. Hand-generated lexers can almost always be faster, because people can bring additional knowledge to the table.

Machine-generated lexers are much more convenient for languages with large sets of tokens with complex regexes per token (C++11 and COBOL fit this category), or for languages involving Unicode (because the number of transitions from each state can be large and the categories really messy, thank you Unicode). Unicode becoming a preferred character set suggests machine-generated lexers are going to be the long term natural choice. (Surprisingly, many lexer generators don't seem to handle Unicode. [I have one that does]).

How to tokenize cpp source?

Answers (2)

Related Questions