HaskellProgrammer
HaskellProgrammer

Reputation: 31

How do I parse infix instead of prefix with Haskell?

I need help with this program I am trying to write in Haskell. I have written most of it, and here is what I am basically trying to do:

  1. When I write

parse "a + b"

in the terminal I want this as output:

Plus (Word "a") (Word "b")

  1. When I write

parse "a - 2 * b + c"

in the terminal I want this as output:

Minus (Word "a") (Plus (Mult (Num 2) (Word "b")) (Word "c"))

My code so far:

data Ast
    = Word String
    | Num Int
    | Mult Ast Ast
    | Plus Ast Ast
    | Minus Ast Ast
    deriving (Eq, Show)

tokenize :: [Char] -> [String]
tokenize [] = []
tokenize (' ' : s) = tokenize s
tokenize ('+' : s) = "+" : tokenize s
tokenize ('*' : s) = "*" : tokenize s
tokenize (c : s)
  | isDigit c =
    let (cs, s') = collectWhile isDigit s
     in (c : cs) : tokenize s'
  | isAlpha c =
    let (cs, s') = collectWhile isAlpha s
     in (c : cs) : tokenize s'
  | otherwise = error ("unexpected character " ++ show c)

collectWhile :: (Char -> Bool) -> String -> (String, String)
collectWhile p s = (takeWhile p s, dropWhile p s)

isDigit, isAlpha :: Char -> Bool
isDigit c = c `elem` ['0' .. '9']
isAlpha c = c `elem` ['a' .. 'z'] ++ ['A' .. 'Z']

parseU :: [String] -> (Ast, [String])
parseU ("+" : s0) =
  let (e1, s1) = parseU s0
      (e2, s2) = parseU s1
   in (Plus e1 e2, s2)
parseU ("*" : s0) =
  let (e1, s1) = parseU s0
      (e2, s2) = parseU s1
   in (Mult e1 e2, s2)
parseU (t : ts)
  | isNumToken t = (Num (read t), ts)
  | isWordToken t = (Word t, ts)
  | otherwise = error ("unrecognized token " ++ show t)
parseU [] = error "unexpected end of input"

isNumToken, isWordToken :: String -> Bool
isNumToken xs = takeWhile isDigit xs == xs
isWordToken xs = takeWhile isAlpha xs == xs

parse :: String -> Ast
parse s =
  case parseU (tokenize s) of
    (e, []) -> e
    (_, t : _) -> error ("unexpected token " ++ show t)

inn :: Ast -> String
inn (Plus x y) = innP x ++ " + " ++ innP y
inn (Mult x y) = innP x ++ " * " ++ innP y
inn ast = innP ast

innP :: Ast -> String
innP (Num n) = show n
innP (Plus x y) = "(" ++ innP x ++ " + " ++ innP y ++ ")"
innP (Mult x y) = "(" ++ innP x ++ " * " ++ innP y ++ ")"
innP (Word w) = w -- 

innfiks :: String -> String
innfiks s = inn (parse s)

Right now I get an error posting the text I wrote in the terminal, but when I write it like this:

parse "+ a b"

I get the correct output:

Plus (Word "a") (Word "b")

I know I have to change the code so it accepts what I send to the parse function on this form:

value operator value,

and not on this form:

operator value value

But im struggling to find out what and where I have to do this change.

Upvotes: 3

Views: 993

Answers (1)

K. A. Buhr
K. A. Buhr

Reputation: 51039

To handle infix operators with precedence, one approach is to introduce a sequence of parsing functions corresponding to the precedence levels. So, if you have "factors" that can be multiplied to create "terms" which can be added or subtracted to create "expressions", you'll want to create parser functions for each of these levels. Parsing a "factor" (i.e., a word or number) is easy, since you've already written the code:

parseFactor :: [String] -> (Ast, [String])
parseFactor (t : ts)
  | isNumToken t = (Num (read t), ts)
  | isWordToken t = (Word t, ts)
  | otherwise = error ("unrecognized token " ++ show t)
parseFactor [] = error "unexpected end of input"

Parsing a term is trickier. You want to start by parsing a factor and then, optionally, a * followed by another factor, and then treat that as a term to be further optionally multiplied by another factor, and so on. The following is one way to do it:

parseTerm :: [String] -> (Ast, [String])
parseTerm ts
  =  let (f1, ts1) = parseFactor ts     -- parse first factor
     in  go f1 ts1
  where go acc ("*":ts2)                -- add a factor to an accumulating term
          = let (f2, ts3) = parseFactor ts2
            in go (Mult acc f2) ts3
        go acc rest = (acc, rest)       -- no more factors: return the term

If you want, try writing a similar parseExpr to parse terms separated by + characters (skipping subtraction for now), and test it on something like:

parseExpr (tokenize "2 + 3 * 6 + 4 * 8 * 12 + 1")

For spoilers, here's a version that handles both + and -, though note that your tokenizer doesn't yet handle subtraction correctly, so you'll have to fix that first.

parseExpr :: [String] -> (Ast, [String])
parseExpr ts
  =  let (f1, ts1) = parseTerm ts
     in  go f1 ts1
  where go acc (op:ts2)
          | op == "+" || op == "-"
          = let (f2, ts3) = parseTerm ts2
            in go ((astOp op) acc f2) ts3
        go acc rest = (acc, rest)
        astOp "+" = Plus
        astOp "-" = Minus

With that in place, you can point parse to the right parser:

parse :: String -> Ast
parse s =
  case parseExpr (tokenize s) of
    (e, []) -> e
    (_, t : _) -> error ("unexpected token " ++ show t)

and your example should work:

λ> parse "a - 2 * b + c"
Plus (Minus (Word "a") (Mult (Num 2) (Word "b"))) (Word "c")

Note that this is slightly different than the output you said you wanted, but this ordering is correct for left-associative operators (which is important for correctly handling -). That is, you want:

5 - 4 + 1

to parse as:

(5 - 4) + 1  -- i.e., (Plus (Minus (Num 5) (Num 4)) (Num 1))

so that an evaluator will calculate the correct answer of 2. If you parse it as:

5 - (4 + 1)  -- i.e., (Minus (Num 5) (Plus (Num 4) (Num 1)))

your evaluator would calculate the wrong answer of 0.

However, if you really want to parse with right-associative operators, see below.

The full modified code for left-associative operators:

data Ast
    = Word String
    | Num Int
    | Mult Ast Ast
    | Plus Ast Ast
    | Minus Ast Ast
    deriving (Eq, Show)

tokenize :: [Char] -> [String]
tokenize [] = []
tokenize (' ' : s) = tokenize s
tokenize ('-' : s) = "-" : tokenize s
tokenize ('+' : s) = "+" : tokenize s
tokenize ('*' : s) = "*" : tokenize s
tokenize (c : s)
  | isDigit c =
    let (cs, s') = collectWhile isDigit s
     in (c : cs) : tokenize s'
  | isAlpha c =
    let (cs, s') = collectWhile isAlpha s
     in (c : cs) : tokenize s'
  | otherwise = error ("unexpected character " ++ show c)

collectWhile :: (Char -> Bool) -> String -> (String, String)
collectWhile p s = (takeWhile p s, dropWhile p s)

isDigit, isAlpha :: Char -> Bool
isDigit c = c `elem` ['0' .. '9']
isAlpha c = c `elem` ['a' .. 'z'] ++ ['A' .. 'Z']

parseFactor :: [String] -> (Ast, [String])
parseFactor (t : ts)
  | isNumToken t = (Num (read t), ts)
  | isWordToken t = (Word t, ts)
  | otherwise = error ("unrecognized token " ++ show t)
parseFactor [] = error "unexpected end of input"

parseTerm :: [String] -> (Ast, [String])
parseTerm ts
  =  let (f1, ts1) = parseFactor ts
     in  go f1 ts1
  where go acc ("*":ts2)
          = let (f2, ts3) = parseFactor ts2
            in go (Mult acc f2) ts3
        go acc rest = (acc, rest)

parseExpr :: [String] -> (Ast, [String])
parseExpr ts
  =  let (f1, ts1) = parseTerm ts
     in  go f1 ts1
  where go acc (op:ts2)
          | op == "+" || op == "-"
          = let (f2, ts3) = parseTerm ts2
            in go ((astOp op) acc f2) ts3
        go acc rest = (acc, rest)
        astOp "+" = Plus
        astOp "-" = Minus

isNumToken, isWordToken :: String -> Bool
isNumToken xs = takeWhile isDigit xs == xs
isWordToken xs = takeWhile isAlpha xs == xs

parse :: String -> Ast
parse s =
  case parseExpr (tokenize s) of
    (e, []) -> e
    (_, t : _) -> error ("unexpected token " ++ show t)

For right-associative operators, modify these definitions:

parseTerm :: [String] -> (Ast, [String])
parseTerm ts
  =  let (fct, ts1) = parseFactor ts
     in  case ts1 of
           "*":ts2 -> let (trm, rest) = parseTerm ts2
                      in  (Mult fct trm, rest)
           _       -> (fct, ts1)

parseExpr :: [String] -> (Ast, [String])
parseExpr ts
  =  let (trm, ts1) = parseTerm ts
     in  case ts1 of
           op:ts2 | op == "+" || op == "-"
                   -> let (expr, rest) = parseExpr ts2
                      in  (astOp op trm expr, rest)
           _       -> (trm, ts1)
  where astOp "+" = Plus
        astOp "-" = Minus*

Upvotes: 3

Related Questions