Parsing a list of tokens into an expression tree

Question

I want to parse expressions like those in typical Haskell source. I get an input stream, which is already tokenized and annotated with fixity and precedence. The set of operators is not known at compile time and may be arbitrary. The output should be a tree representing the expression. Here's a bit of what I tried:

-- A single token of the input stream
data Token a
  = Prefix a
  | Infix a Int Fixity -- The Int parameter represents the precedence
  | LBrace
  | RBrace
  deriving (Show,Eq)

data Fixity
  = InfixL
  | InfixR
  | InfixC
  deriving (Show,Eq)

data Expression a
  = Atom a
  | Apply a a
  deriving Show

-- Wrapped into either, if expression is malformed.
exprToTree :: [Token a] -> Either String (Expression a)
exprToTree = undefined

For the sake of simpleness, I don't treat lambdas special, they are just atoms.

But now, I am completely lost. How can I convert the stream of atoms into a tree? Can somebody please point me to an algorithm or help me with finding one.

stephen tetley · Accepted Answer

In a nutshell then, even though you have a token list you still need a parser.

Parsec can handle alternative token streams, but you'll probably have to refer to the manual - a PDF available at Daan Leijen's "legacy" home page - http://legacy.cs.uu.nl/daan/download/parsec/parsec.pdf. You can roll your own parser without using a combinator library but you will be re-implementing some fraction of Parsec. As far as I remember UU_parsing expects to work with a separate scanner so its another option.

Although it doesn't handle parsing you might find Lennart Augustsson's "Lambda calculus cooked four ways" helpful for other things - http://www.augustsson.net/Darcs/Lambda/top.pdf

Edit - here is a partly worked out plan of how you can go about it with Parsec, for details you'll have to consult section 2.11 of the manual.

Suppose you have this data type for concrete "internal" tokens:

data InternalTok = Ident String
                 | BinOpPlus
                 | BinOpMinus
                 | UnaryOpNegate
                 | IntLiteral Int
  deriving (Show)

Then you end get these types for the Parsec token and parse:

type MyToken = Token InternalTok

type MyParser a = GenParser MyToken () a

Define a helper function as per the Parsec manual - this handles show and pos so individual definitions are shorter cf. the mytoken function on page 19.

mytoken :: (MyToken -> Maybe a) -> MyParser a
mytoken test = token showToken posToken testToken
  where
    showToken tok = show tok
    posToken tok = no_pos
    testToken tok = test tok

For the moment your token type does not track source position, so:

no_pos :: SourcePos
no_pos = newPos "" 0 0 0

For each terminal you have to define a token function:

identifier :: MyParser MyToken
identifier =  mytoken (	ok -> case tok of
                         a@(Prefix (Ident _)) -> Just a
                         _                    -> Nothing)

intLiteral :: MyParser MyToken
intLiteral =  mytoken (	ok -> case tok of
                         a@(Prefix (IntLiteral _)) -> Just a
                         _                         -> Nothing)

binPlus :: MyParser MyToken
binPlus =  mytoken (	ok -> case tok of
                      a@(Infix BinOpPlus _ _) -> Just a
                      _                       -> Nothing)


binMinus :: MyParser MyToken
binMinus =  mytoken (	ok -> case tok of
                      a@(Infix BinOpMinus _ _) -> Just a
                      _                        -> Nothing)

unaryNegate :: MyParser MyToken
unaryNegate =  mytoken (	ok -> case tok of
                        a@(Prefix UnaryNegate _ _) -> Just a
                        _                          -> Nothing)

Edit - to handle custom infix operators you'll need these token parsers:

tokInfixL :: Int -> MyParser MyToken
tokInfixL n = mytoken $ 	ok -> case tok of
    a@(Infix _ i InfixL) | i == n -> Just a
    _                             -> Nothing)


tokInfixR :: Int -> MyParser MyToken
tokInfixR n = mytoken $ 	ok -> case tok of
    a@(Infix _ i InfixR) | i == n -> Just a
    _                             -> Nothing)

tokInfixC :: Int -> MyParser MyToken
tokInfixC n = mytoken $ 	ok -> case tok of
    a@(Infix _ i InfixC) | i == n -> Just a
    _                             -> Nothing)


tokPrefix :: MyParser MyToken
tokPrefix = mytoken (	ok -> case tok of
                       a@(Prefix _) -> Just a
                       _            -> Nothing)

Now you can define the parser - you need to fix the number of levels of precedence beforehand, there is no way around that fact as you need to code a parser for each level.

The top-level expression parse is simply calls the highest precedence parser

pExpression :: Parser Expersion
pExpression = expression10

For each precendence level you need a parser roughly like this, you'll have to work out non-assoc for yourself. Also you might need to do some work on chainl / chainr - I've only written a parser in this style with UU_Parsing it might be slightly different for Parsec. Note Apply is usually at the precedence highest level.

expression10 :: Parser Expression
expression10 = 
        Apply   <$> identifier <*> pExpression
    <|> Prefix  <$> tokPrefix  <*> pExpression
    <|> chainl (Infix <$> tokInfixL 10) expression9
    <|> chainr (Infix <$> tokInfixR 10) expression9

expression9 :: Parser Expression
expression9 = 
        Prefix  <$> tokPrefix  <*> pExpression
    <|> chainl (Infix <$> tokInfixL 9) expression8
    <|> chainr (Infix <$> tokInfixR 9) expression8

...

You'll have to extend your syntax to handle IntLiterals and Identifiers which are at level 0 in precedence:

expression0 :: Parser Expression
expression0 = 
        IntLit  <$> intLiteral 
    <|> Ident   <$> identifier
    <|> ...

Edit - for unlimited precedence - maybe if you only have application and Atom maybe something like this would work. Note you'll have to change the tokInfixL and tokInfixR parsers to no longer match assoc-level and you may have to experiment with the order of alternatives.

expression :: Parser Expression
expression = 
        Apply   <$> identifier <*> expression
    <|> Prefix  <$> tokPrefix  <*> expression
    <|> chainl (Infix <$> tokInfixL) expression
    <|> chainr (Infix <$> tokInfixR) expression
    <|> intLiteral
    <|> identifier

intLiteral :: Parser Expression
intLiteral = Atom . convert <$> intLiteral
  where
    convert = ??

identifier :: Parser Expression
identifier = Atom . convert <$> intLiteral
  where
    convert = ??

Parsing a list of tokens into an expression tree

Answers (2)

Related Questions