Reputation: 3613
NB. I'm using this Alex template from Simon Marlow.
I'd like to create lexer for C-style comments. My current approach creates separate tokens for starting comments, ending, middle and oneline
%wrapper "monad"
tokens :-
<0> $white+ ;
<0> "/*" { mkL LCommentStart `andBegin` comment }
<comment> . { mkL LComment }
<comment> "*/" { mkL LCommentEnd `andBegin` 0 }
<0> "//" .*$ { mkL LSingleLineComment }
data LexemeClass
= LEOF
| LCommentStart
| LComment
| LCommentEnd
| LSingleLineComment
/*blabla*/
I will get 8 tokens instead of one!//
part from single line comment token?monad
wrapper?Upvotes: 2
Views: 1643
Reputation: 52029
Have a look at this:
Test with something like:
echo "This /* is a */ test" | ./c_comment
which should print:
Right [W "This",CommentStart,CommentBody " is a ",CommentEnd,W "test"]
The key alex routines you need to use are:
alexGetInput -- gets the current input state
alexSetInput -- sets the current input state
alexGetByte -- returns the next byte and input state
andBegin -- return a token and set the current start code
Each of the routines commentBegin
, commentEnd
and commentBody
have the following signature:
AlexInput -> Int -> Alex Lexeme
where Lexeme
stands for the your token type. The AlexInput
parameter has the form (for the monad wrapper):
(AlexPosn, Char, [Bytes], String)
The Int
parameter is the length of the match stored in the String
field. Therefore the form of most token handlers will be:
handler :: AlexInput -> Int -> Alex Lexeme
handler (pos,_,_,inp) len = ... do something with (take len inp) and pos ...
In general it seems that a handler can ignore the Char
and [Bytes]
fields.
The handlers commentBegin
and commentEnd
can ignore both the AlexInput
and Int
arguments because they just match fixed length strings.
The commentBody
handler works by calling alexGetByte
to accumulate the comment body until "*/" is found. As far as I know C comments may not be nested so the comment ends at the first occurrence of "*/".
Note that the first character of the comment body is in the match0
variable. In fact, my code has a bug in it since it will not match "/**/" correctly. It should look at match0
to decide whether to start at loop
or loopStar
.
You can use the same technique to parse "//" style comments - or any token where a non-greedy match is required.
Another key point is that patterns like $white+
are qualified with a start code:
<0>$white+
This is done so that they are not active while processing comments.
You can use another wrapper, but note that the structure of the AlexInput
type may be different -- e.g. for the basic wrapper it is just a 3-tuple: (Char,[Byte],String)
. Just look at the definition of AlexInput
in the generated .hs file.
A final note... accumulating characters using ++
is, of course, rather inefficient. You probably want to use Text
(or ByteString
) for the accumulator.
Upvotes: 2