Parsec lookahead to handle ints

Question

I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).

I've made a lot of progress, but am currently stuck with the following.

I need to be able to parse a line somewhat like this:

4  0.123  1.452  0.667  *  3.460  149 - -

Semantically, the 4 is a nodeNum, the Floats and the * are negative log probabilities (so, * represents the negative log of probability zero). The 149 and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.

Here's what I have so far:

This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.

 emAnnotationSet = (,,) <$> p_int  <*>
                           (reqSpaces *> char '-') <*>
                           (reqSpaces *> char '-')

the nodeNum at the beginning of the line is handled by another parser that works and I need not get into.

The problem is in trying to pick out all the p_logProbs from the line, without consuming the digits at the beginning of the emAnnotationSet.

the parser for p_logProb looks like this:

p_logProb = liftA mkScore (lp  "logProb")
          where lp = try dub <|> string "*"
                dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit

And finally, I try to separate the logProb entries from the trailing emAnnotationSet (which starts with an integer) as follows:

hmmMatchEmissions     = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces) 
                      <* optSpaces <* emAnnotationSet <* eol 
                       "matchEmissions"

So, p_logProb will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).

I'd hoped that the try in the p_logProb definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet:

Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb

column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb parser is consuming the integer. How can I fix this so the p_logProb parser uses lookahead correctly, thus leaving that input for the emAnnotationSet parser?

pat · Accepted Answer

The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme combinator converts a parser into one that skips trailing spaces.

import Text.Parsec
import Text.Parsec.String
import Data.Char
import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )

fractional :: Fractional a => Parser a
fractional = try $ do
  n <- fromIntegral <$> decimal
  char '.'
  f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit  
  return $ n + f

decimal :: Parser Int
decimal = foldl (
 d -> 10 * n + digitToInt d) 0 <$> many1 digit

lexeme :: Parser a -> Parser a
lexeme p = p <* skipMany (char ' ')

data Row = Row Int [Maybe Double]
           deriving ( Show )

probability :: Fractional a => Parser (Maybe a)
probability = (Just <$> fractional) <|> (Nothing <$ char '*')

junk = lexeme decimal <* count 2 (lexeme $ char '-')

row :: Parser Row
row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk

rows :: Parser [Row]
rows = spaces *> sepEndBy row (lexeme newline) <* eof

Usage:

*Main> parseTest rows "4 0.123 1.234 2.345 149 - -
5 0.123 * 2.345 149 - -" 
[Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]

Parsec lookahead to handle ints

Answers (2)

Related Questions