Reputation: 355
I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).
I've made a lot of progress, but am currently stuck with the following.
I need to be able to parse a line somewhat like this:
4 0.123 1.452 0.667 * 3.460 149 - -
Semantically, the 4
is a nodeNum, the Floats
and the *
are negative log probabilities (so, *
represents the negative log of probability zero). The 149
and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.
Here's what I have so far:
This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.
emAnnotationSet = (,,) <$> p_int <*>
(reqSpaces *> char '-') <*>
(reqSpaces *> char '-')
the nodeNum
at the beginning of the line is handled by another parser that works and I need not get into.
The problem is in trying to pick out all the p_logProb
s from the line, without consuming the digits at the beginning of the emAnnotationSet
.
the parser for p_logProb
looks like this:
p_logProb = liftA mkScore (lp <?> "logProb")
where lp = try dub <|> string "*"
dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit
And finally, I try to separate the logProb
entries from the trailing emAnnotationSet
(which starts with an integer) as follows:
hmmMatchEmissions = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces)
<* optSpaces <* emAnnotationSet <* eol
<?> "matchEmissions"
So, p_logProb
will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).
I'd hoped that the try
in the p_logProb
definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet
:
Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb
column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb
parser is consuming the integer. How can I fix this so the p_logProb
parser uses lookahead correctly, thus leaving that input for the emAnnotationSet
parser?
Upvotes: 1
Views: 980
Reputation: 12749
The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme
combinator converts a parser into one that skips trailing spaces.
import Text.Parsec
import Text.Parsec.String
import Data.Char
import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )
fractional :: Fractional a => Parser a
fractional = try $ do
n <- fromIntegral <$> decimal
char '.'
f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit
return $ n + f
decimal :: Parser Int
decimal = foldl (\n d -> 10 * n + digitToInt d) 0 <$> many1 digit
lexeme :: Parser a -> Parser a
lexeme p = p <* skipMany (char ' ')
data Row = Row Int [Maybe Double]
deriving ( Show )
probability :: Fractional a => Parser (Maybe a)
probability = (Just <$> fractional) <|> (Nothing <$ char '*')
junk = lexeme decimal <* count 2 (lexeme $ char '-')
row :: Parser Row
row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk
rows :: Parser [Row]
rows = spaces *> sepEndBy row (lexeme newline) <* eof
Usage:
*Main> parseTest rows "4 0.123 1.234 2.345 149 - -\n5 0.123 * 2.345 149 - -"
[Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]
Upvotes: 2
Reputation: 14668
I'm not exactly sure of your problem. However, to parse the line given based on your description, it would be much easier to use existing lexers define in Text.Parsec.Token1
, and join them together.
The below code parses the line into a Line data type, you can process it further from there if necessary. Instead of attempting to filter out the -
and integers before parsing, it uses a parseEntry
parser that returns a Just Double
if it is a Float value, Just 0
for *
, and Nothing
for integers and dashes. This is then very simply filtered using catMaybes
.
Here is the code:
module Test where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellDef)
import Control.Applicative ((<$>))
import Data.Maybe (catMaybes)
lexer = P.makeTokenParser haskellDef
parseFloat = P.float lexer
parseInteger = P.natural lexer
whiteSpace = P.whiteSpace lexer
parseEntry = try (Just <$> parseFloat)
<|> try (const (Just 0) <$> (char '*' >> whiteSpace))
<|> try (const Nothing <$> (char '-' >> whiteSpace))
<|> (const Nothing <$> parseInteger)
data Line = Line {
lineNodeNum :: Integer
, negativeLogProbabilities :: [Double]
} deriving (Show)
parseLine = do
nodeNum <- parseInteger
whiteSpace
probabilities <- catMaybes <$> many1 parseEntry
return $ Line { lineNodeNum = nodeNum, negativeLogProbabilities = probabilities }
Example usage:
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46]}
The only issue that may (or may not) be a problem is it will parse *-
as two different tokens, rather than fail at parsing. Eg
*Test> parseTest parseLine "4 0.123 1.452 0.667 * 3.460 149 - -*"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46,0.0]}
Note the extra 0.0
at the end of the log probabilities.
Upvotes: 1