Noah Daniels
Noah Daniels

Reputation: 355

Parsec lookahead to handle ints

I'm working on a Parsec parser to handle a somewhat complex data file format (and I have no control over this format).

I've made a lot of progress, but am currently stuck with the following.

I need to be able to parse a line somewhat like this:

4  0.123  1.452  0.667  *  3.460  149 - -

Semantically, the 4 is a nodeNum, the Floats and the * are negative log probabilities (so, * represents the negative log of probability zero). The 149 and the minus signs are really junk, which I can discard, but I need to at least make sure they don't break the parser.

Here's what I have so far:

This handles the "junk" I mentioned. It could probably be simpler, but it works by itself.

 emAnnotationSet = (,,) <$> p_int  <*>
                           (reqSpaces *> char '-') <*>
                           (reqSpaces *> char '-')

the nodeNum at the beginning of the line is handled by another parser that works and I need not get into.

The problem is in trying to pick out all the p_logProbs from the line, without consuming the digits at the beginning of the emAnnotationSet.

the parser for p_logProb looks like this:

p_logProb = liftA mkScore (lp <?> "logProb")
          where lp = try dub <|> string "*"
                dub = (++) <$> ((++) <$> many1 digit <*> string ".") <*> many1 digit

And finally, I try to separate the logProb entries from the trailing emAnnotationSet (which starts with an integer) as follows:

hmmMatchEmissions     = optSpaces *> (V.fromList <$> sepBy p_logProb reqSpaces) 
                      <* optSpaces <* emAnnotationSet <* eol 
                      <?> "matchEmissions"

So, p_logProb will only succeed on a float that begins with digits, includes a decimal point, and then has further digits (this restriction is respected by the file format).

I'd hoped that the try in the p_logProb definition would avoid consuming the leading digits if it didn't parse the decimal and the rest, but this doesn't seem to work; Parsec still complains that it sees an unexpected space after the digits of that integer in the emAnnotationSet:

Left "hmmNode" (line 1, column 196):
unexpected " "
expecting logProb

column 196 corresponds to the space after the integer preceding the minus signs, so it's clear to me that the problem is that the p_logProb parser is consuming the integer. How can I fix this so the p_logProb parser uses lookahead correctly, thus leaving that input for the emAnnotationSet parser?

Upvotes: 1

Views: 980

Answers (2)

pat
pat

Reputation: 12749

The integer which terminates the probabilities cannot be mistaken for a probability since it doesn't contain a decimal point. The lexeme combinator converts a parser into one that skips trailing spaces.

import Text.Parsec
import Text.Parsec.String
import Data.Char
import Control.Applicative ( (<$>), (<*>), (<$), (<*), (*>) )

fractional :: Fractional a => Parser a
fractional = try $ do
  n <- fromIntegral <$> decimal
  char '.'
  f <- foldr (\d f -> (f + fromIntegral (digitToInt d))/10.0) 0.0 <$> many1 digit  
  return $ n + f

decimal :: Parser Int
decimal = foldl (\n d -> 10 * n + digitToInt d) 0 <$> many1 digit

lexeme :: Parser a -> Parser a
lexeme p = p <* skipMany (char ' ')

data Row = Row Int [Maybe Double]
           deriving ( Show )

probability :: Fractional a => Parser (Maybe a)
probability = (Just <$> fractional) <|> (Nothing <$ char '*')

junk = lexeme decimal <* count 2 (lexeme $ char '-')

row :: Parser Row
row = Row <$> lexeme decimal <*> many1 (lexeme probability) <* junk

rows :: Parser [Row]
rows = spaces *> sepEndBy row (lexeme newline) <* eof

Usage:

*Main> parseTest rows "4 0.123 1.234 2.345 149 - -\n5 0.123 * 2.345 149 - -" 
[Row 4 [Just 0.123,Just 1.234,Just 2.345],Row 5 [Just 0.123,Nothing,Just 2.345]]

Upvotes: 2

David Miani
David Miani

Reputation: 14668

I'm not exactly sure of your problem. However, to parse the line given based on your description, it would be much easier to use existing lexers define in Text.Parsec.Token1, and join them together.

The below code parses the line into a Line data type, you can process it further from there if necessary. Instead of attempting to filter out the - and integers before parsing, it uses a parseEntry parser that returns a Just Double if it is a Float value, Just 0 for *, and Nothing for integers and dashes. This is then very simply filtered using catMaybes.

Here is the code:

module Test where
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellDef)
import Control.Applicative ((<$>))
import Data.Maybe (catMaybes)
lexer = P.makeTokenParser haskellDef

parseFloat = P.float lexer
parseInteger = P.natural lexer 

whiteSpace = P.whiteSpace lexer

parseEntry = try (Just <$> parseFloat)
             <|> try (const (Just 0) <$> (char '*' >> whiteSpace))
             <|> try (const Nothing <$> (char '-' >> whiteSpace))
             <|> (const Nothing <$> parseInteger)


data Line = Line {
      lineNodeNum :: Integer
    , negativeLogProbabilities :: [Double]
    } deriving (Show)

parseLine = do
  nodeNum <- parseInteger
  whiteSpace
  probabilities <- catMaybes <$> many1 parseEntry
  return $ Line { lineNodeNum = nodeNum, negativeLogProbabilities = probabilities }

Example usage:

*Test> parseTest parseLine "4  0.123  1.452  0.667  *  3.460  149 - -"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46]}

The only issue that may (or may not) be a problem is it will parse *- as two different tokens, rather than fail at parsing. Eg

*Test> parseTest parseLine "4  0.123  1.452  0.667  *  3.460  149 - -*"
Line {lineNodeNum = 4, negativeLogProbabilities = [0.123,1.452,0.667,0.0,3.46,0.0]}

Note the extra 0.0 at the end of the log probabilities.

Upvotes: 1

Related Questions