Reputation: 33
I'm taking a course on compilers at my university. I'm choosing to do the project using Haskell + Parsec. The lexer and parser are required to be separate. I'm using Parsec to convert a string into a list of tokens, which will then be passed onto another Parsec parser that converts a list of tokens into an AST.
The problem is that the lexer is supposed to continue attempting to lex, even in the case of an error. To try to do this, I introduced a token representing an "unexpected token" to my Token data type, and I tried to sprinkle my code with <|> unexpected in order to produce this token in case of an error. This is a lot of boilerplate, and also can be difficult to reason about where to place these.
My preferred solution would be to somehow have Parsec automatically do this: If there's ever a ParseError, produce an unexpected token at that position, and continue parsing one position later. How would I do this?
Here is a snippet of some of the code I have now: http://lpaste.net/8144414997276000256 For some reason I can still get a parse error, even though the Unexpected token should be catching the non-handled cases.
Upvotes: 3
Views: 214
Reputation: 68152
It seems like you should be able to get away with having a single extra unexpected
term. I assume you have a token
type that looks something like this:
token' = number
<|> identifier
<|> ...
I'd probably have each token (number
, identifier
... etc) manage its own whitespace:
number :: Parser Token
number = Number . read <$> many1 digit <* spaces
Why not add your extra unexpected term as a catch-all at the end of this?
token' = number
<|> identifier
<|> ...
<|> unexpected'
Have it consume a single character. For better error messages, you could even include the character in the value. Then, when you use this to create a list, you will get an Unexpected
value for each character your lexer doesn't know what to do with.
unexpected' :: Parser Token
unexpected' = Unexpected <$ anyChar
Finally, the whole lex is just a many token'
. In my tests, this deals fine with invalid characters in the middle.
*Main> parse (many token') "<foo>" "1 2 abc ~ ~def"
Right [Number 1,Number 2,Identifier "abc",Unexpected,Unexpected,Unexpected,Identifier "def"]
One thing to keep in mind is that Parsec does not backtrack by default. This means that if a parse fails part of the way through a token, it will not go back and try unexpected
instead: you will just get an error. To enable backtracking, you have to use try
on the parser that might error. For example, if identifier
needs two characters:
identifier :: Parser Token
identifier = Identifier <$> liftA2 (:) letter (many1 alphaNum) <* spaces
Then it might fail part of the way through and not backtrack. But if you wrap it in a try
, it should work:
token' = number
<|> try identifier
<|> ...
The problem with try
is that it can slow your code down if you aren't careful. However, if you don't mind the slowdown, you might be able to get away with just adding try
everywhere and backtracking a lot!
Upvotes: 1