How to unambiguously trim a string with parser combinators?

My current implementation with Text.ParserCombinators.ReadP:

trim :: ReadP String
trim = do skipSpaces
          content <- some get
          skipSpaces
          eof
          return content

The problem is, ambiguous results frequently pop up when run:

ghci> readP_to_S trim " hello world "
[("hello world",""),("hello world ","")]

I believe I need to specify that the trimmed content will not end with space characters, but I have no idea how to express that apart from this extremely inefficient design:

trim :: ReadP String
trim = do skipSpaces
          content <- many get
          lastOne <- satisfy (not . isSpace)
          skipSpaces
          eof
          return (content ++ [lastOne])

We can have better things, right?

Upvotes: 1

Answers (4)

Daniel Wagner

Reputation: 153172

If the inefficiency you are worried about stems from many get frequently backtracking, you could require each sequence of whitespace to end with a non-whitespace. You can also use the munch family to avoid the backtracking introduced by many (satisfy ...).

trim :: ReadP String
trim = do skipSpaces
          content <- some $ liftA2 (++)
             (munch isSpace)
             (munch1 (not . isSpace))
          skipSpaces
          eof
          return (concat content)

Upvotes: 0

K. A. Buhr

Reputation: 51119

The following parser ought to work:

trim :: ReadP String
trim = skipSpaces *> content
  where
    content = do
      spc <- spaces
      ((\w rest -> spc ++ w ++ rest) <$> word <*> content)
        <++ pure ""
    word = munch1 (not .isSpace)
    spaces = munch isSpace

The idea is that the content parser reads (possibly empty) leading whitespace and then attempts to read a non-empty, non-whitespace word. If such a word is found, the leading whitespace and word are prefixed to additional content. If no word is found, the "leading whitespace" must be trailing whitespace at the end of the input string and is discarded (i.e., pure "").

By itself, content will right-trim spaces off the input. By invoking it after skipSpaces in the definition of trim, we ensure that the result will be left-trimmed as well.

This parser should always return a single alternative [("trimmed string","")], as per the following quick check:

import Test.QuickCheck
import Data.List
import Data.Char
import Text.ParserCombinators.ReadP

trim :: ReadP String
trim = skipSpaces *> content
  where
    content = do
      spc <- spaces
      ((\w rest -> spc ++ w ++ rest) <$> word <*> content)
        <++ pure ""
    word = munch1 (not .isSpace)
    spaces = munch isSpace

test_trim :: String -> Bool
test_trim str = readP_to_S trim str == [(dropWhile isSpace (dropWhileEnd isSpace str), "")]

main = quickCheck test_trim

Upvotes: 0

David Fletcher

Reputation: 2818

You could parse a word at a time, then put them back together.

(This does compress any multiple spaces into single ones, if that matters.)

trim :: ReadP String
trim = skipSpaces *> content <* eof
  where
    content = unwords <$> many token
    token = munch1 (not . isSpace) <* skipSpaces

But a more direct way is to say that the parse is terminated by a component which is zero or more spaces then eof. Then we can use manyTill, which will attempt to match internal spaces as the terminator, but will fail because those spaces aren't followed by eof.

trim :: ReadP String
trim = skipSpaces *> manyTill get (skipSpaces <* eof)

(This is potentially slow if you have long runs of internal spaces, since after each one the parser will go through the rest to see if an eof is at the end.)

Upvotes: 0

willeM_ Van Onsem

Reputation: 477533

Instead of trying to do this through parser combinatorics, why not just parse the content and then trim, like:

import Data.Char(isSpace)
import Data.List(dropWhileEnd)

trim :: ReadP String
trim = do skipSpaces
          content <- many get
          skipSpaces
          eof
          return (trimData content)

trimStr :: String -> String
trimStr = dropWhileEnd isSpace . dropWhile isSpace

The parser can be shortened then to:

trim :: ReadP String
trim = trimStr <$> (many get <* eof)

We thus first retrieve all characters, and then we return the trimmed string, although the eof might not be necessary here.

Upvotes: 0

How to unambiguously trim a string with parser combinators?

Answers (4)

Related Questions