How to use parsec to get sub-strings of specific pattern in a string

Question

I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,

"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."

the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.

With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:

hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
           ; many space
           ; c1 <- digit
           ; cs <- many hao
           ; return (c:c1:cs)
           }
parse tuhao "" str -- can parse the str and get one sub-string.

However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.

So I appreciate any tip! ^_^

sshine · Accepted Answer

F12, F 12, F1(a), F2a, F5-A, F34-5

This is an incomplete description, so I'll make some guesses.

I would start by defining a type that can contain the logical parts of these expressions. E.g.
```
newtype F = F (Int, Maybe String) deriving Show
```
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.

Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]

I would then proceed to write parsers for each of these sub-parts and compose them:

module Main where

import Text.Parsec
import Data.Maybe (catMaybes)

symbol :: String -> Parser String
symbol s = string s <* spaces

parens :: Parser a -> Parser a
parens = between (string "(") (string ")")

digits :: Parser Int
digits = read <$> many1 digit

parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
  where
    firstPart :: Parser Int
    firstPart = symbol "F" >> digits

    secondPart :: Parser (Maybe String)
    secondPart = optionMaybe $ choice
      [ many1 letter
      , parens (many1 letter)
      , string "-" >> many1 alphaNum
      ]

(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,

extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
        <|> do anyChar >> extract p
        <|> do eof >> return []

readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s

main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")

This prints:

Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]

Takeaways:

You can parse optional whitespace using token parsing (symbol).
You can parse optional parts with option, optionMaybe or optional.
You can alternate between combinators using a <|> b <|> c or choice [a, b, c].
When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)

[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:

newtype F = F (Int, Maybe String, Maybe String)

How to use parsec to get sub-strings of specific pattern in a string

Answers (2)

Related Questions