How do I turn this regex into a Megaparsec parser without making a mess?

Consider this regex:

^foo/[^=]+/baz=(.*),[^,]*$

If I run it on foo/bar/baz=one,two, it matches and the subgroup captures one. If I run it on foo/bar/baz/bar/baz=three,four,five, it matches and the subgroup captures three,four.

I know how to turn this into a regex-applicative parser or a ReadP parser:

import Text.Regex.Applicative
match (string "foo/" *> some (psym (/= '=')) *> string "/baz=" *> many anySym <* sym ',' <* many (psym (/= ','))) <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Just "one",Just "three,four"]
import Text.ParserCombinators.ReadP
readP_to_S (string "foo/" *> many1 (satisfy (/= '=')) *> string "/baz=" *> many get <* char ',' <* many (satisfy (/= ',')) <* eof) <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [[("one","")],[("three,four","")]]

And both of those work just the way I want them to. But when I try to transliterate that directly into Megaparsec, it goes badly:

import Text.Megaparsec
parse (chunk "foo/" *> some (anySingleBut '=') *> chunk "/baz=" *> many anySingle <* single ',' <* many (anySingleBut ',') <* eof) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Left (ParseErrorBundle {bundleErrors = TrivialError 11 (Just (Tokens ('=' :| "one,"))) (fromList [Tokens ('/' :| "baz=")]) :| [], bundlePosState = PosState {pstateInput = "foo/bar/baz=one,two", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}}),Left (ParseErrorBundle {bundleErrors = TrivialError 19 (Just (Tokens ('=' :| "thre"))) (fromList [Tokens ('/' :| "baz=")]) :| [], bundlePosState = PosState {pstateInput = "foo/bar/baz/bar/baz=three,four,five", pstateOffset = 0, pstateSourcePos = SourcePos {sourceName = "", sourceLine = Pos 1, sourceColumn = Pos 1}, pstateTabWidth = Pos 8, pstateLinePrefix = ""}})]

I know this stems from Megaparsec not backtracking by default. I tried to fix this by just sticking try in a bunch of different places, but I couldn't get that to work. Eventually, I got this monstrosity with notFollowedBy to work:

import Text.Megaparsec
parse (chunk "foo/" *> some (noneOf "=/" <|> try (single '/' <* notFollowedBy (chunk "baz="))) *> chunk "/baz=" *> many (try (anySingle <* notFollowedBy (many (anySingleBut ',') <* eof))) <* single ',' <* many (anySingleBut ',') <* eof) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Right "one",Right "three,four"]

But that looks like a mess! In particular, I don't like that I effectively had to specify much of the pattern twice. And technically, wouldn't that be equivalent to the regex ^foo/(?:[^=/]|/(?!baz=))+/baz=((?:.(?![^,]*$))*),[^,]*$, rather than my initial regex? There's got to be a better way to write that parser. How do I do it?


Edit: I also tried it this way, which also works (nope, it incorrectly accepts foo//baz=,):

import Text.Megaparsec
parse (chunk "foo/" *> (some . try $ many (noneOf "=/") *> single '/') *> chunk "baz=" *> ((++) <$> many (anySingleBut ',') <*> (concat <$> manyTill ((:) <$> single ',' <*> many (anySingleBut ',')) (try $ single ',' *> many (anySingleBut ',') *> eof)))) "" <$> ["foo/bar/baz=one,two", "foo/bar/baz/bar/baz=three,four,five"]
-- [Right "one",Right "three,four"]

It seems just as messy, though, and manyTill means it doesn't really map onto any regex anymore.

Upvotes: 3

Views: 556

Answers (1)

Daniel Wagner
Daniel Wagner

Reputation: 152682

Without reading carefully, I guess the bit that's giving you trouble is this part:

(.*),[^,]*

If so, then consider using

sepBy (many (noneOf ",")) (string ",")

which will parse a list of comma-separated things. Then re-insert commas between all but the last element of that list in pure code afterwards (e.g. with a well-placed fmap).

From the comments, it seems you are also having some trouble with this part:

/[^=]+/baz=

You could consider something like this as a translation for that:

slashPath = string "/" <++> path
path = string "baz=" <|> (many (noneOf "=/") <++> slashPath)
(<++>) = liftA2 (++)

Upvotes: 2

Related Questions