MathematicalOrchid
MathematicalOrchid

Reputation: 62818

Struggling with Applicative parsing

So I'm having a go at writing a complex parser, using only Applicative (the parser in question doesn't even implement Monad at all).

For trivial parsers, this is quite easy. For non-trivial ones... not so much. The applicative interface seems to violently force you to write everything in point-free style. This is extremely difficult to deal with.

Consider, for example:

call = do
  n <- name
  char '('
  trim
  as <- sepBy argument (char ',' >> trim)
  char ')'
  trim
  char '='
  r <- result
  return $ Call {name = n, args = as, result = r}

Now let's try to write that using applicative:

call =
  (\ n _ _ as _ _ _ _ r -> Call {name = n, args = as, result = r}) <$>
  name <*>
  char '(' <*>
  trim <*>
  sepBy argument (const const () <$> char ',' <*> trim) <*>
  char ')' <*>
  trim <*>
  char '=' <*>
  trim <*>
  result

Applicative has forced me to put the variable bindings very far away from where the actual parser is. (E.g., try to confirm that as is actually bound to sepBy argument ...; it's not at all easy to verify that I haven't got the wrong count of _ patterns!)

Another very unintuitive thing is that <*> applies a function to a value, but *> and <* are just pure sequencing. This took for ages to wrap my mind around. Different method names would have made this far, far clearer. (But Monad seems to have grabbed >> and <<, sadly.) It seems that these can be stacked, yielding things like

exit =
  "EXIT:" *>
  trim *>
  name <*
  char '(' <*
  trim <*
  char ')' <*
  trim

It's rather non-obvious that you can do this. And, to me, this code really isn't terribly readable. More importantly, I still haven't figured out how you deal with collecting multiple values while dropping multiple other values.

In all, I find myself wishing I could just use do-notation! I don't actually need to change effects based on prior results; I don't need the power of Monad. But the notation is so much more readable. (I keep wondering whether it would actually be feasible to implement this; can you syntactically tell when a particular do-block can be mechanically transformed to applicative?)

Does anybody know of a way around these problems? Most particularly, how can I move the variable bindings closer to the parser they bind to?

Upvotes: 6

Views: 340

Answers (3)

Rufflewind
Rufflewind

Reputation: 8956

I don't really have a solution for the problem, but perhaps some intuition might help you to build applicative parsers more easily. When it comes to applicative, there are two kinds of "sequencing" that needs to be taken into consideration:

  • The sequencing of the parsing operations: this is what determines the order in which you write the parsers.
  • The sequencing of the underlying values: this one is more flexible, as you can combine them in any order you like.

When the two sequences match each other well, the result is a very nice and compact representation of the parser in applicative notation. For example:

data Infix = Infix     Double     Operator     Double
infix      = Infix <$> number <*> operator <*> number

The problem is that when the sequence don't match exactly, you have to massage the underlying values in order for things to work (you can't change the ordering of the parsers):

number = f <$> sign <*> decimal <*> exponent
  where f sign decimal exponent = sign * decimal * 10 ^^ exponent

Here, in order to compute the number you have to do a slightly nontrivial combination of operations, which is done by the local function f.

Another typical situation is that you need to discard some value:

exponent = oneOf "eE" *> integer

Here, *> discards the value on the left, keep the value on the right. The <* operator does opposite, discarding the right and keeping the left. When you have a chain of such operations, you have to decode them using the left-associativity:

p1 *> p2 <* p3 *> p4 <* p5  ≡  (((p1 *> p2) <* p3) *> p4) <* p5

This is artificially contrived: you generally don't want to do this. It's better to break up the expression into meaningful pieces (and preferably give meaningful names). One common pattern that you will see is:

-- discard the result of everything except `p3`
p1 *> p2 *> p3 <* p4 <* p5

There is a small caveat though, if you want to apply something else to p3 or if p3 consists of multiple parts, you will have to use parentheses:

-- applying a pure function
f <$> (p1 *> p2 *> p3 <* p4 <* p5)  ≡  p1 *> p2 *> (f <$> p3) <* p4 <* p5

-- p3 consists of multiple parts
p1 *> p2 *> (p3' <*> p3'') <* p4 <* p5)

Again, in these situations it's often better to just break up the expression into meaningful fragments with names.

Applicative notation, in some sense, forces you into dividing up the parsers into logical chunks so that it's easier to read, as opposed to the monadic notation where you could conceivably do everything in one monolithic block.

Upvotes: 4

kosmikus
kosmikus

Reputation: 19637

Well, your example parser is artificially complicated.

There are lots of occurrences of trim that you can abstract from:

token p = p <* trim

You can also abstract from something that occurs between a pair of matching parentheses:

parens p = token (char '(') *> p <* token (char ')')

Now what's left is:

call =
  (\ n as _ r -> Call {name = n, args = as, result = r}) <$>
  name <*>
  parens (sepBy argument (() <$ token (char ','))) <*>
  token (char '=') <*>
  result

Finally, you shouldn't count occurrences of _, but rather learn to use <$ and <*. Here are useful rules of thumb:

  • Only use *> in the combination foo *> p <* bar such as in parens above, nowhere else.

  • Make your parsers have the form f <$> p1 <*> ... <*> pn, and now choose between <$> and <$ in the first position or between <*> and <* in all other positions purely on the question of whether you're interested in the result of the subsequent parser. If you are, use the variant with the >, otherwise, use the one without. Then you never need to ignore any arguments in f, because you're not even getting access to them. In the reduced example case above, only the = token is left that we're not interested in, so we can say

    call = Call <$> name
                <*> parens (sepBy argument (() <$ token (char ',')))
                <*  token (char '=')
                <*> result
    

(This is assuming that Call actually takes only these three arguments.) I'd argue that this version is easier to read even than your original do-based version.

To answer your more general question: Yes, it's possible to recognize do-statements that don't need the power of monads. Simply put, they're the ones which are just a sequence of bindings with a return in the very end, and all of the bound variables are only used in the final return and nowhere else. There is a proposal to add this to GHC. (Personally, I'm not a huge fan of it, however. I think applicative notation is more functional than do-notation.)

Upvotes: 9

Zeta
Zeta

Reputation: 105886

Write smaller parsers. For example, your arguments seem to be (argument[, argument…]). That can be easily expressed by

argListP :: Parser [Argument]
argListP = char '(' *> trim *> argument `sepBy` (char ',' *> trim) <* char ')'

which is still quite readable: a '(' followed by whitespace, arguments separated by comma and whitespace, and a trailing ')'. Same can be done for your result:

resultP :: Parser Result
resultP  = trim *> char '=' *> result

As you can see, that is still readable: arbitrary whitespace, followed by equal sign and some kind of result. Now call is almost trivial:

call :: Parser Call
call = Call <$> name <*> argListP <*> resultP

Upvotes: 4

Related Questions