swingbit
swingbit

Reputation: 2745

Happy + Alex: How to parse into keyword or identifier based on the context

The very minimal grammar shown here has the sole purpose of illustrating the problem I encountered. No meaning attached.

Example inputs and outputs:

IN:  select TOP [1]
OUT: Select TOP (ConstInt 1)

IN:  select MIN [AVG(2,3)]
OUT: Select MIN (Function "AVG" [ConstInt 2,ConstInt 3])

IN:  select TOP [MIN(2,3)]
OUT: Parse error. Unexpected token TokAssumption "MIN"

The reason for the parse error is that this grammar allows string MIN to be both an "assumption" (the keyword right after select) and a function name. In the last example, it is parsed as an assumption, while it should have been parsed as a function name in that position.

When MIN is encountered, the first lexer rule is the one that reads it as a TokAssumption, so it is not read as a TokIdentifier.

My question is how to handle this. Obviously, the simplest solution would be make sure assumptions and function names cannot overlap. But this is not possible in the actual parser I'm trying it fix.

My intuition is that this should be possible by using a monadic action from the parser, to check the lookahead token. When the lookahead token is a (, it means the current token should be lexed as a TokIdentifier.

Am I looking in the right direction? Is there a better way to handle such cases?

This is the Happy parser for this silly grammar:

{
module Parser (
  parser,
  L.runAlex
) where

import Data
import qualified Lexer as L
}

%name parser algexpr
%tokentype { L.Token }
%error { parseError }
%monad { L.Alex } { >>= } { pure }
%lexer { lexer } { L.EOF }

%token

'('       { L.TokLParen }
')'       { L.TokRParen }
'['       { L.TokLSquare }
']'       { L.TokRSquare }
','       { L.TokComma }
'select'  { L.TokSelect }
int       { L.TokConstInt $$ }
ident     { L.TokIdentifier $$ }
ass       { L.TokAssumption $$ }

%%

commaSep(p)
  : commaSep_(p) { reverse $1 }
commaSep_(p)
  : {- empty -} { [] }
  | p { [$1] }
  | commaSep_(p) ',' p { $3 : $1 }

algexpr :: { AlgExpr }
  : 'select' assumption '[' expr ']' { Select $2 $4 }

assumption :: { Assumption }
  : ass { read $1 }

expr :: { Expr }
  : int { ConstInt $1 }
  | ident '(' commaSep(expr) ')' { Function $1 $3 }

{
parseError :: L.Token -> L.Alex a
parseError tok = do
  (L.AlexPn _ line column, _, _, _) <- L.alexGetInput
  L.alexError $ "Parse error. Unexpected token " <> show tok <> " at line " <> show line <> ", column " <> show column

lexer :: (L.Token -> L.Alex a) -> L.Alex a
lexer = (=<< L.alexMonadScan)
}

This is the Alex lexer:

{
module Lexer (
    Alex
  , AlexPosn (..)
  , alexGetInput
  , alexError
  , runAlex
  , alexMonadScan
  , Token (..)
) where

}

%wrapper "monad"

$digit = 0-9
$alpha = [a-zA-Z]
@ident = $alpha [$alpha $digit _]*

tokens :-

<0> $white+     { skip }

<0> "("       { tok TokLParen }
<0> ")"       { tok TokRParen }
<0> "["       { tok TokLSquare }
<0> "]"       { tok TokRSquare }
<0> ","       { tok TokComma }

<0> select    { tok TokSelect }

<0> MAX       { tokS TokAssumption }
<0> MIN       { tokS TokAssumption }
<0> TOP       { tokS TokAssumption }

<0> $digit+   { tokInt }
<0> @ident    { tokS TokIdentifier }

{
data Token
  = TokLParen
  | TokRParen
  | TokLSquare
  | TokRSquare
  | TokComma
  | TokAssumption String
  | TokSelect
  | TokConstInt Int
  | TokIdentifier String
  | EOF
  deriving (Eq, Show)

alexEOF :: Alex Token
alexEOF = pure EOF

tok :: Token -> AlexAction Token
tok ctor _ _ = pure ctor

tokS :: (String -> Token) -> AlexAction Token
tokS f (_, _, _, str) len = pure $ f $ take len str

tokInt :: AlexAction Token
tokInt (_, _, _, str) len = pure $ TokConstInt $ read $ take len str

}

Data.hs:

module Data( AlgExpr(..), Expr(..), Assumption(..) ) where

data AlgExpr = Select Assumption Expr
             deriving (Eq,Show)

data Expr = ConstInt    Int
          | Function    String [Expr]
          deriving (Eq,Show)

data Assumption = MAX | MIN | TOP
                deriving (Eq,Show,Read)

Upvotes: 2

Views: 93

Answers (0)

Related Questions