danbst
danbst

Reputation: 3613

How to parse C-style comments with Alex lexer?

NB. I'm using this Alex template from Simon Marlow.

I'd like to create lexer for C-style comments. My current approach creates separate tokens for starting comments, ending, middle and oneline

%wrapper "monad"

tokens :-
  <0> $white+ ;
  <0> "/*"               { mkL LCommentStart `andBegin` comment }
  <comment> .            { mkL LComment }
  <comment> "*/"         { mkL LCommentEnd `andBegin` 0 }
  <0> "//" .*$           { mkL LSingleLineComment }

data LexemeClass
  = LEOF
  | LCommentStart
  | LComment
  | LCommentEnd
  | LSingleLineComment

Upvotes: 2

Views: 1643

Answers (1)

ErikR
ErikR

Reputation: 52029

Have a look at this:

http://lpaste.net/107377

Test with something like:

echo "This /* is a */ test" | ./c_comment

which should print:

Right [W "This",CommentStart,CommentBody " is a ",CommentEnd,W "test"]

The key alex routines you need to use are:

alexGetInput -- gets the current input state
alexSetInput -- sets the current input state
alexGetByte  -- returns the next byte and input state
andBegin     -- return a token and set the current start code

Each of the routines commentBegin, commentEnd and commentBody have the following signature:

AlexInput -> Int -> Alex Lexeme

where Lexeme stands for the your token type. The AlexInput parameter has the form (for the monad wrapper):

(AlexPosn, Char, [Bytes], String)

The Int parameter is the length of the match stored in the String field. Therefore the form of most token handlers will be:

handler :: AlexInput -> Int -> Alex Lexeme
handler (pos,_,_,inp) len = ... do something with (take len inp) and pos ...

In general it seems that a handler can ignore the Char and [Bytes] fields.

The handlers commentBegin and commentEnd can ignore both the AlexInput and Int arguments because they just match fixed length strings.

The commentBody handler works by calling alexGetByte to accumulate the comment body until "*/" is found. As far as I know C comments may not be nested so the comment ends at the first occurrence of "*/".

Note that the first character of the comment body is in the match0 variable. In fact, my code has a bug in it since it will not match "/**/" correctly. It should look at match0 to decide whether to start at loop or loopStar.

You can use the same technique to parse "//" style comments - or any token where a non-greedy match is required.

Another key point is that patterns like $white+ are qualified with a start code:

<0>$white+

This is done so that they are not active while processing comments.

You can use another wrapper, but note that the structure of the AlexInput type may be different -- e.g. for the basic wrapper it is just a 3-tuple: (Char,[Byte],String). Just look at the definition of AlexInput in the generated .hs file.

A final note... accumulating characters using ++ is, of course, rather inefficient. You probably want to use Text (or ByteString) for the accumulator.

Upvotes: 2

Related Questions