Reputation: 92966
I want to parse expressions like those in typical Haskell source. I get an input stream, which is already tokenized and annotated with fixity and precedence. The set of operators is not known at compile time and may be arbitrary. The output should be a tree representing the expression. Here's a bit of what I tried:
-- A single token of the input stream
data Token a
= Prefix a
| Infix a Int Fixity -- The Int parameter represents the precedence
| LBrace
| RBrace
deriving (Show,Eq)
data Fixity
= InfixL
| InfixR
| InfixC
deriving (Show,Eq)
data Expression a
= Atom a
| Apply a a
deriving Show
-- Wrapped into either, if expression is malformed.
exprToTree :: [Token a] -> Either String (Expression a)
exprToTree = undefined
For the sake of simpleness, I don't treat lambdas special, they are just atoms.
But now, I am completely lost. How can I convert the stream of atoms into a tree? Can somebody please point me to an algorithm or help me with finding one.
Upvotes: 4
Views: 2644
Reputation: 92966
After searching the web for another topic, I found this nice piece of code to do exactly what I want. Have a look:
data Op = Op String Prec Fixity deriving Eq
data Fixity = Leftfix | Rightfix | Nonfix deriving Eq
data Exp = Var Var | OpApp Exp Op Exp deriving Eq
type Prec = Int
type Var = String
data Tok = TVar Var | TOp Op
parse :: [Tok] -> Exp
parse (TVar x : rest) = fst (parse1 (Var x) (-1) Nonfix rest)
parse1 :: Exp -> Int -> Fixity -> [Tok] -> (Exp, [Tok])
parse1 e p f [] = (e, [])
parse1 e p f inp@(TOp op@(Op _ p' f') : TVar x : rest)
| p' == p && (f /= f' || f == Nonfix)
= error "ambiguous infix expression"
| p' < p || p' == p && (f == Leftfix || f' == Nonfix)
= (e, inp)
| otherwise
= let (r,rest') = parse1 (Var x) p' f' rest in
parse1 (OpApp e op r) p f rest'
-- Printing
instance Show Exp where
showsPrec _ (Var x) = showString x
showsPrec p e@(OpApp l (Op op _ _) r) =
showParen (p > 0) $ showsPrec 9 l . showString op . showsPrec 9 r
-- Testing
plus = TOp (Op "+" 6 Leftfix)
times = TOp (Op "*" 7 Leftfix)
divide = TOp (Op "/" 7 Leftfix)
gt = TOp (Op ">" 4 Nonfix)
ex = TOp (Op "^" 8 Rightfix)
lookupop '+' = plus
lookupop '*' = times
lookupop '/' = divide
lookupop '>' = gt
lookupop '^' = ex
fromstr [x] = [TVar [x]]
fromstr (x:y:z) = TVar [x] : lookupop y : fromstr z
test1 = fromstr "a+b+c"
test2 = fromstr "a+b+c*d"
test3 = fromstr "a/b/c"
test4 = fromstr "a/b+c"
test5 = fromstr "a/b*c"
test6 = fromstr "1^2^3+4"
test7 = fromstr "a/1^2^3"
test8 = fromstr "a*b/c"
(I took it from this page: http://hackage.haskell.org/trac/haskell-prime/attachment/wiki/FixityResolution/resolve.hs)
Upvotes: 0
Reputation: 4493
In a nutshell then, even though you have a token list you still need a parser.
Parsec can handle alternative token streams, but you'll probably have to refer to the manual - a PDF available at Daan Leijen's "legacy" home page - http://legacy.cs.uu.nl/daan/download/parsec/parsec.pdf. You can roll your own parser without using a combinator library but you will be re-implementing some fraction of Parsec. As far as I remember UU_parsing expects to work with a separate scanner so its another option.
Although it doesn't handle parsing you might find Lennart Augustsson's "Lambda calculus cooked four ways" helpful for other things - http://www.augustsson.net/Darcs/Lambda/top.pdf
Edit - here is a partly worked out plan of how you can go about it with Parsec, for details you'll have to consult section 2.11 of the manual.
Suppose you have this data type for concrete "internal" tokens:
data InternalTok = Ident String
| BinOpPlus
| BinOpMinus
| UnaryOpNegate
| IntLiteral Int
deriving (Show)
Then you end get these types for the Parsec token and parse:
type MyToken = Token InternalTok
type MyParser a = GenParser MyToken () a
Define a helper function as per the Parsec manual - this handles show and pos so individual definitions are shorter cf. the mytoken
function on page 19.
mytoken :: (MyToken -> Maybe a) -> MyParser a
mytoken test = token showToken posToken testToken
where
showToken tok = show tok
posToken tok = no_pos
testToken tok = test tok
For the moment your token type does not track source position, so:
no_pos :: SourcePos
no_pos = newPos "" 0 0 0
For each terminal you have to define a token function:
identifier :: MyParser MyToken
identifier = mytoken (\tok -> case tok of
a@(Prefix (Ident _)) -> Just a
_ -> Nothing)
intLiteral :: MyParser MyToken
intLiteral = mytoken (\tok -> case tok of
a@(Prefix (IntLiteral _)) -> Just a
_ -> Nothing)
binPlus :: MyParser MyToken
binPlus = mytoken (\tok -> case tok of
a@(Infix BinOpPlus _ _) -> Just a
_ -> Nothing)
binMinus :: MyParser MyToken
binMinus = mytoken (\tok -> case tok of
a@(Infix BinOpMinus _ _) -> Just a
_ -> Nothing)
unaryNegate :: MyParser MyToken
unaryNegate = mytoken (\tok -> case tok of
a@(Prefix UnaryNegate _ _) -> Just a
_ -> Nothing)
Edit - to handle custom infix operators you'll need these token parsers:
tokInfixL :: Int -> MyParser MyToken
tokInfixL n = mytoken $ \tok -> case tok of
a@(Infix _ i InfixL) | i == n -> Just a
_ -> Nothing)
tokInfixR :: Int -> MyParser MyToken
tokInfixR n = mytoken $ \tok -> case tok of
a@(Infix _ i InfixR) | i == n -> Just a
_ -> Nothing)
tokInfixC :: Int -> MyParser MyToken
tokInfixC n = mytoken $ \tok -> case tok of
a@(Infix _ i InfixC) | i == n -> Just a
_ -> Nothing)
tokPrefix :: MyParser MyToken
tokPrefix = mytoken (\tok -> case tok of
a@(Prefix _) -> Just a
_ -> Nothing)
Now you can define the parser - you need to fix the number of levels of precedence beforehand, there is no way around that fact as you need to code a parser for each level.
The top-level expression parse is simply calls the highest precedence parser
pExpression :: Parser Expersion
pExpression = expression10
For each precendence level you need a parser roughly like this, you'll have to work out non-assoc for yourself. Also you might need to do some work on chainl / chainr - I've only written a parser in this style with UU_Parsing it might be slightly different for Parsec. Note Apply is usually at the precedence highest level.
expression10 :: Parser Expression
expression10 =
Apply <$> identifier <*> pExpression
<|> Prefix <$> tokPrefix <*> pExpression
<|> chainl (Infix <$> tokInfixL 10) expression9
<|> chainr (Infix <$> tokInfixR 10) expression9
expression9 :: Parser Expression
expression9 =
Prefix <$> tokPrefix <*> pExpression
<|> chainl (Infix <$> tokInfixL 9) expression8
<|> chainr (Infix <$> tokInfixR 9) expression8
...
You'll have to extend your syntax to handle IntLiterals and Identifiers which are at level 0 in precedence:
expression0 :: Parser Expression
expression0 =
IntLit <$> intLiteral
<|> Ident <$> identifier
<|> ...
Edit - for unlimited precedence - maybe if you only have application and Atom maybe something like this would work. Note you'll have to change the tokInfixL and tokInfixR parsers to no longer match assoc-level and you may have to experiment with the order of alternatives.
expression :: Parser Expression
expression =
Apply <$> identifier <*> expression
<|> Prefix <$> tokPrefix <*> expression
<|> chainl (Infix <$> tokInfixL) expression
<|> chainr (Infix <$> tokInfixR) expression
<|> intLiteral
<|> identifier
intLiteral :: Parser Expression
intLiteral = Atom . convert <$> intLiteral
where
convert = ??
identifier :: Parser Expression
identifier = Atom . convert <$> intLiteral
where
convert = ??
Upvotes: 5