Z-Y.L
Z-Y.L

Reputation: 1769

How to use parsec to get sub-strings of specific pattern in a string

I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,

"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."

the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.

With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:

hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
           ; many space
           ; c1 <- digit
           ; cs <- many hao
           ; return (c:c1:cs)
           }
parse tuhao "" str -- can parse the str and get one sub-string.

However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.

So I appreciate any tip! ^_^

Upvotes: 4

Views: 703

Answers (2)

James Brock
James Brock

Reputation: 3426

We can get sub-strings of specific pattern in a string with the findAll combinator from replace-megaparsec.

Notice that this tuhao parser doesn't actually return anything. The findAll combinator just checks for success of the parser to find sub-strings which match the pattern.

import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either

let tuhao :: Parsec Void String ()
    tuhao = do
        void $ single 'F'
        void $ space
        void $ digitChar
        void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"

input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."

rights $ fromJust $ parseMaybe (findAll tuhao) input
["F12","F 12","F1(a)","F2a","F5-A","F34-5"]

Upvotes: 1

sshine
sshine

Reputation: 16105

F12, F 12, F1(a), F2a, F5-A, F34-5

This is an incomplete description, so I'll make some guesses.

  1. I would start by defining a type that can contain the logical parts of these expressions. E.g.

    newtype F = F (Int, Maybe String) deriving Show
    

    That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.

    Since the examples are limited, I assume that the following aren't valid: F1a(b), F1(a)b, F1a-5, F1(a)-A, F1a(a)-5, F1a1, F1-(a), etc. and that the following are valid: F1A, F1abc, F1(abc), F1-abc, F1-a1b2. This is probably not true. [1]

  2. I would then proceed to write parsers for each of these sub-parts and compose them:

    module Main where
    
    import Text.Parsec
    import Data.Maybe (catMaybes)
    
    symbol :: String -> Parser String
    symbol s = string s <* spaces
    
    parens :: Parser a -> Parser a
    parens = between (string "(") (string ")")
    
    digits :: Parser Int
    digits = read <$> many1 digit
    
    parseF :: Parser F
    parseF = curry F <$> firstPart <*> secondPart
      where
        firstPart :: Parser Int
        firstPart = symbol "F" >> digits
    
        secondPart :: Parser (Maybe String)
        secondPart = optionMaybe $ choice
          [ many1 letter
          , parens (many1 letter)
          , string "-" >> many1 alphaNum
          ]
    
  3. (As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,

    extract :: Parser a -> Parser [a]
    extract p = do (:) <$> try p <*> extract p
            <|> do anyChar >> extract p
            <|> do eof >> return []
    
    readFs :: String -> Either ParseError [F]
    readFs s = parse (extract parseF) "" s
    
    main :: IO ()
    main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
    

    This prints:

    Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
    

Takeaways:

  • You can parse optional whitespace using token parsing (symbol).

  • You can parse optional parts with option, optionMaybe or optional.

  • You can alternate between combinators using a <|> b <|> c or choice [a, b, c].

  • When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter, string "(" and string "-", i.e. not overlapping.)

[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B, F1(a)-5 and F1(a)-5A are valid, in which case I might change the model to:

newtype F = F (Int, Maybe String, Maybe String)

Upvotes: 2

Related Questions