Reputation: 1769
I'm a newbie to Haskell, and now I'm learning to use parsec. I get stuck in one problem, that is, I want to get all the sub-strings which satisfies some specific pattern in a string. For example, from the following string,
"I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',"-"]."
the result should be [F12, F12, F1(a), F2a, F5-A, F34-5], where the space between the F and the digit should be deleted.
With the parsec, I have succeeded in getting one sub-string, such as F12, F2a. The code is as follows:
hao :: Parser Char
hao = oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
tuhao :: Parser String
tuhao = do { c <- char 'F'
; many space
; c1 <- digit
; cs <- many hao
; return (c:c1:cs)
}
parse tuhao "" str -- can parse the str and get one sub-string.
However, I am stuck at how to parse the example string above and get all the sub-strings of the specific pattern. I have an idea that if F is found, then begin parsing, else skip parsing or if parsing fails then skip parsing. But I don't know how to implement the plan. I have another idea that uses State
to record the remaining string that is not parsed, and use recursion, but still fail to carry it out.
So I appreciate any tip! ^_^
Upvotes: 4
Views: 703
Reputation: 3426
We can get sub-strings of specific pattern in a string with the
findAll
combinator from
replace-megaparsec.
Notice that this tuhao
parser doesn't actually return anything. The findAll
combinator just checks for success of the parser to find sub-strings which match the pattern.
import Replace.Megaparsec
import Text.Megaparsec
import Text.Megaparsec.Char
import Data.Maybe
import Data.Either
let tuhao :: Parsec Void String ()
tuhao = do
void $ single 'F'
void $ space
void $ digitChar
void $ many $ oneOf "abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ()-"
input = "I want to choose F12 or F 12 from F1(a), F2a, F5-A, F34-5 and so on, but F alone should not be chosen, that is, choose those which start with F followed by a digit (before the digit there could be zero or more than one space) and then by any character from ['a'..'z'] ++ ['A'..'Z'] ++ ['0'..'9'] ++ ['(',')',\"-\"]."
rights $ fromJust $ parseMaybe (findAll tuhao) input
["F12","F 12","F1(a)","F2a","F5-A","F34-5"]
Upvotes: 1
Reputation: 16105
F12
,F 12
,F1(a)
,F2a
,F5-A
,F34-5
This is an incomplete description, so I'll make some guesses.
I would start by defining a type that can contain the logical parts of these expressions. E.g.
newtype F = F (Int, Maybe String) deriving Show
That is, "F" followed by a number and an optional part that is either letters, parenthesised letters, or a dash followed by letters/digits. Since the number after "F" can have multiple digits, I assume that the optional letters/digits may be multiple, too.
Since the examples are limited, I assume that the following aren't valid: F1a(b)
, F1(a)b
, F1a-5
, F1(a)-A
, F1a(a)-5
, F1a1
, F1-(a)
, etc. and that the following are valid: F1A
, F1abc
, F1(abc)
, F1-abc
, F1-a1b2
. This is probably not true. [1]
I would then proceed to write parsers for each of these sub-parts and compose them:
module Main where
import Text.Parsec
import Data.Maybe (catMaybes)
symbol :: String -> Parser String
symbol s = string s <* spaces
parens :: Parser a -> Parser a
parens = between (string "(") (string ")")
digits :: Parser Int
digits = read <$> many1 digit
parseF :: Parser F
parseF = curry F <$> firstPart <*> secondPart
where
firstPart :: Parser Int
firstPart = symbol "F" >> digits
secondPart :: Parser (Maybe String)
secondPart = optionMaybe $ choice
[ many1 letter
, parens (many1 letter)
, string "-" >> many1 alphaNum
]
(As Jon Purdy writes in a comment,) using this parser on a string to get multiple matches,
extract :: Parser a -> Parser [a]
extract p = do (:) <$> try p <*> extract p
<|> do anyChar >> extract p
<|> do eof >> return []
readFs :: String -> Either ParseError [F]
readFs s = parse (extract parseF) "" s
main :: IO ()
main = print (readFs "F12, F 12, F1(a), F2a, F5-A, F34-5")
This prints:
Right [F (12,Nothing),F (12,Nothing),F (1,Just "a"),F (2,Just "a"),F (5,Just "A"),F (34,Just "5")]
Takeaways:
You can parse optional whitespace using token parsing (symbol
).
You can parse optional parts with option
, optionMaybe
or optional
.
You can alternate between combinators using a <|> b <|> c
or choice [a, b, c]
.
When alternating between choices, make sure they don't have overlapping FIRST sets. Otherwise you need to try
; this is nasty but sometimes unavoidable. (In this case, FIRST sets for the three choices are letter
, string "("
and string "-"
, i.e. not overlapping.)
[1]: For the sake of restriction, I kept to the assumptions above, but I felt that I could also have assumed that F1a-B
, F1(a)-5
and F1(a)-5A
are valid, in which case I might change the model to:
newtype F = F (Int, Maybe String, Maybe String)
Upvotes: 2