fuz
fuz

Reputation: 92966

Parsing a list of tokens into an expression tree

I want to parse expressions like those in typical Haskell source. I get an input stream, which is already tokenized and annotated with fixity and precedence. The set of operators is not known at compile time and may be arbitrary. The output should be a tree representing the expression. Here's a bit of what I tried:

-- A single token of the input stream
data Token a
  = Prefix a
  | Infix a Int Fixity -- The Int parameter represents the precedence
  | LBrace
  | RBrace
  deriving (Show,Eq)

data Fixity
  = InfixL
  | InfixR
  | InfixC
  deriving (Show,Eq)

data Expression a
  = Atom a
  | Apply a a
  deriving Show

-- Wrapped into either, if expression is malformed.
exprToTree :: [Token a] -> Either String (Expression a)
exprToTree = undefined

For the sake of simpleness, I don't treat lambdas special, they are just atoms.

But now, I am completely lost. How can I convert the stream of atoms into a tree? Can somebody please point me to an algorithm or help me with finding one.

Upvotes: 4

Views: 2644

Answers (2)

fuz
fuz

Reputation: 92966

After searching the web for another topic, I found this nice piece of code to do exactly what I want. Have a look:

data Op     = Op String Prec Fixity          deriving Eq
data Fixity = Leftfix | Rightfix | Nonfix    deriving Eq
data Exp    = Var Var | OpApp Exp Op Exp     deriving Eq
type Prec   = Int
type Var    = String

data Tok = TVar Var | TOp Op

parse :: [Tok] -> Exp
parse (TVar x : rest) = fst (parse1 (Var x) (-1) Nonfix rest)

parse1 :: Exp -> Int -> Fixity -> [Tok] -> (Exp, [Tok])
parse1 e p f [] = (e, [])
parse1 e p f inp@(TOp op@(Op _ p' f') : TVar x : rest) 
  | p' == p && (f /= f' || f == Nonfix)
  = error "ambiguous infix expression"
  | p' < p  ||  p' == p && (f == Leftfix || f' == Nonfix)
  = (e, inp)
  | otherwise
  = let (r,rest') = parse1 (Var x) p' f' rest in
    parse1 (OpApp e op r) p f rest'

-- Printing

instance Show Exp where
  showsPrec _ (Var x) = showString x
  showsPrec p e@(OpApp l (Op op _ _) r) = 
        showParen (p > 0) $ showsPrec 9 l . showString op . showsPrec 9 r

-- Testing

plus   = TOp (Op "+" 6 Leftfix)
times  = TOp (Op "*" 7 Leftfix)
divide = TOp (Op "/" 7 Leftfix)
gt     = TOp (Op ">" 4 Nonfix)
ex     = TOp (Op "^" 8 Rightfix)

lookupop '+' = plus
lookupop '*' = times
lookupop '/' = divide
lookupop '>' = gt
lookupop '^' = ex

fromstr [x]     = [TVar [x]]
fromstr (x:y:z) = TVar [x] : lookupop y : fromstr z

test1 = fromstr "a+b+c"
test2 = fromstr "a+b+c*d"
test3 = fromstr "a/b/c"
test4 = fromstr "a/b+c"
test5 = fromstr "a/b*c"
test6 = fromstr "1^2^3+4"
test7 = fromstr "a/1^2^3"
test8 = fromstr "a*b/c"

(I took it from this page: http://hackage.haskell.org/trac/haskell-prime/attachment/wiki/FixityResolution/resolve.hs)

Upvotes: 0

stephen tetley
stephen tetley

Reputation: 4493

In a nutshell then, even though you have a token list you still need a parser.

Parsec can handle alternative token streams, but you'll probably have to refer to the manual - a PDF available at Daan Leijen's "legacy" home page - http://legacy.cs.uu.nl/daan/download/parsec/parsec.pdf. You can roll your own parser without using a combinator library but you will be re-implementing some fraction of Parsec. As far as I remember UU_parsing expects to work with a separate scanner so its another option.

Although it doesn't handle parsing you might find Lennart Augustsson's "Lambda calculus cooked four ways" helpful for other things - http://www.augustsson.net/Darcs/Lambda/top.pdf

Edit - here is a partly worked out plan of how you can go about it with Parsec, for details you'll have to consult section 2.11 of the manual.

Suppose you have this data type for concrete "internal" tokens:

data InternalTok = Ident String
                 | BinOpPlus
                 | BinOpMinus
                 | UnaryOpNegate
                 | IntLiteral Int
  deriving (Show)

Then you end get these types for the Parsec token and parse:

type MyToken = Token InternalTok

type MyParser a = GenParser MyToken () a

Define a helper function as per the Parsec manual - this handles show and pos so individual definitions are shorter cf. the mytoken function on page 19.

mytoken :: (MyToken -> Maybe a) -> MyParser a
mytoken test = token showToken posToken testToken
  where
    showToken tok = show tok
    posToken tok = no_pos
    testToken tok = test tok

For the moment your token type does not track source position, so:

no_pos :: SourcePos
no_pos = newPos "" 0 0 0 

For each terminal you have to define a token function:

identifier :: MyParser MyToken
identifier =  mytoken (\tok -> case tok of
                         a@(Prefix (Ident _)) -> Just a
                         _                    -> Nothing)

intLiteral :: MyParser MyToken
intLiteral =  mytoken (\tok -> case tok of
                         a@(Prefix (IntLiteral _)) -> Just a
                         _                         -> Nothing)

binPlus :: MyParser MyToken
binPlus =  mytoken (\tok -> case tok of
                      a@(Infix BinOpPlus _ _) -> Just a
                      _                       -> Nothing)


binMinus :: MyParser MyToken
binMinus =  mytoken (\tok -> case tok of
                      a@(Infix BinOpMinus _ _) -> Just a
                      _                        -> Nothing)

unaryNegate :: MyParser MyToken
unaryNegate =  mytoken (\tok -> case tok of
                        a@(Prefix UnaryNegate _ _) -> Just a
                        _                          -> Nothing)

Edit - to handle custom infix operators you'll need these token parsers:

tokInfixL :: Int -> MyParser MyToken
tokInfixL n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixL) | i == n -> Just a
    _                             -> Nothing)


tokInfixR :: Int -> MyParser MyToken
tokInfixR n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixR) | i == n -> Just a
    _                             -> Nothing)

tokInfixC :: Int -> MyParser MyToken
tokInfixC n = mytoken $ \tok -> case tok of
    a@(Infix _ i InfixC) | i == n -> Just a
    _                             -> Nothing)


tokPrefix :: MyParser MyToken
tokPrefix = mytoken (\tok -> case tok of
                       a@(Prefix _) -> Just a
                       _            -> Nothing)

Now you can define the parser - you need to fix the number of levels of precedence beforehand, there is no way around that fact as you need to code a parser for each level.

The top-level expression parse is simply calls the highest precedence parser

pExpression :: Parser Expersion
pExpression = expression10

For each precendence level you need a parser roughly like this, you'll have to work out non-assoc for yourself. Also you might need to do some work on chainl / chainr - I've only written a parser in this style with UU_Parsing it might be slightly different for Parsec. Note Apply is usually at the precedence highest level.

expression10 :: Parser Expression
expression10 = 
        Apply   <$> identifier <*> pExpression
    <|> Prefix  <$> tokPrefix  <*> pExpression
    <|> chainl (Infix <$> tokInfixL 10) expression9
    <|> chainr (Infix <$> tokInfixR 10) expression9

expression9 :: Parser Expression
expression9 = 
        Prefix  <$> tokPrefix  <*> pExpression
    <|> chainl (Infix <$> tokInfixL 9) expression8
    <|> chainr (Infix <$> tokInfixR 9) expression8

...

You'll have to extend your syntax to handle IntLiterals and Identifiers which are at level 0 in precedence:

expression0 :: Parser Expression
expression0 = 
        IntLit  <$> intLiteral 
    <|> Ident   <$> identifier
    <|> ...

Edit - for unlimited precedence - maybe if you only have application and Atom maybe something like this would work. Note you'll have to change the tokInfixL and tokInfixR parsers to no longer match assoc-level and you may have to experiment with the order of alternatives.

expression :: Parser Expression
expression = 
        Apply   <$> identifier <*> expression
    <|> Prefix  <$> tokPrefix  <*> expression
    <|> chainl (Infix <$> tokInfixL) expression
    <|> chainr (Infix <$> tokInfixR) expression
    <|> intLiteral
    <|> identifier

intLiteral :: Parser Expression
intLiteral = Atom . convert <$> intLiteral
  where
    convert = ??

identifier :: Parser Expression
identifier = Atom . convert <$> intLiteral
  where
    convert = ??

Upvotes: 5

Related Questions