Ben Collins
Ben Collins

Reputation: 20686

With FParsec, how does one use the manyCharsTill and between parsers and not fail on the closing string?

I'm trying to use FParsec to parse a TOML multi-line string, and I'm having trouble with the closing delimiter ("""). I have the following parsers:

let controlChars = 
    ['\u0000'; '\u0001'; '\u0002'; '\u0003'; '\u0004'; '\u0005'; '\u0006'; '\u0007';
     '\u0008'; '\u0009'; '\u000a'; '\u000b'; '\u000c'; '\u000d'; '\u000e'; '\u000f';
     '\u0010'; '\u0011'; '\u0012'; '\u0013'; '\u0014'; '\u0015'; '\u0016'; '\u0017';
     '\u0018'; '\u0019'; '\u001a'; '\u001b'; '\u001c'; '\u001d'; '\u001e'; '\u001f';
     '\u007f']

let nonSpaceCtrlChars =
    Set.difference (Set.ofList controlChars) (Set.ofList ['\n';'\r';'\t'])

let multiLineStringContents : Parser<char,unit> =
    satisfy (isNoneOf nonSpaceCtrlChars)

let multiLineString         : Parser<string,unit> =
    optional newline >>. manyCharsTill multiLineStringContents (pstring "\"\"\"")
    |> between (pstring "\"\"\"") (pstring "\"\"\"") 

let test parser str =
    match run parser str with
    | Success (s1, s2, s3) -> printfn "Ok: %A %A %A" s1 s2 s3
    | Failure (f1, f2, f3) -> printfn "Fail: %A %A %A" f1 f2 f3

When I test multiLineString against an input like this:

test multiLineString "\"\"\"x\"\"\""

The parser fails with this error:

Fail: "Error in Ln: 1 Col: 8 """x""" ^ Note: The error occurred at the end of the input stream. Expecting: '"""'

I'm confused by this. Wouldn't the manyCharsTill multiLineStringContents (pstring "\"\"\"") parser stop at the """ for the between parser to find it? Why is the parser eating all the input and then failing the between parser?

This seems like a relevant post: How to parse comments with FParsec

But I don't see how the solution to that one differs from what I'm doing here, really.

Upvotes: 2

Views: 572

Answers (2)

Ben Collins
Ben Collins

Reputation: 20686

@rmunn provided a correct answer, thanks! I also solved this in a slightly different way after playing with the FParsec API a bit more. As explained in the other answer, The endp argument to manyCharTill was eating the closing """, so I needed to switch to something that wouldn't do that. A simple modification using lookAhead did the trick:

let multiLineString         : Parser<string,unit> =
    optional newline >>. manyCharsTill multiLineStringContents (lookAhead (pstring "\"\"\""))
    |> between (pstring "\"\"\"") (pstring "\"\"\"") 

Upvotes: 3

rmunn
rmunn

Reputation: 36718

The manyCharsTill documentation says (emphasis mine):

manyCharsTill cp endp parses chars with the char parser cp until the parser endp succeeds. It stops after endp and returns the parsed chars as a string.

So you don't want to use between in combination with manyCharsTill; you want to do something like pstring "\"\"\"" >>. manyCharsTill (pstring "\"\"\"").

But as it happens, I can save you a lot of work. I've been working on a TOML parser with FParsec myself in my spare time. It's far from complete, but the string part works and handles backslash escapes correctly (as far as I can tell: I've tested thoroughly but not exhaustively). The only thing I'm missing is the "strip first newline if it appears right after the opening delimiter" rule, which you've handled with optional newline. So just add that bit into my code below and you should have a working TOML string parser.

BTW, I am planning to license my code (if I finish it) under the MIT license. So I hereby release the following code block under the MIT license. Feel free to use it in your project if it's useful to you.

let pShortCodepointInHex = // Anything from 0000 to FFFF, *except* the range D800-DFFF
    (anyOf "dD" >>. (anyOf "01234567" <?> "a Unicode scalar value (range D800-DFFF not allowed)") .>>. exactly 2 isHex |>> fun (c,s) -> sprintf "d%c%s" c s)
    <|> (exactly 4 isHex <?> "a Unicode scalar value")

let pLongCodepointInHex = // Anything from 00000000 to 0010FFFF, *except* the range D800-DFFF
        (pstring "0000" >>. pShortCodepointInHex)
        <|> (pstring "000"  >>. exactly 5 isHex)
        <|> (pstring "0010" >>. exactly 4 isHex |>> fun s -> "0010" + s)
        <?> "a Unicode scalar value (i.e., in range 00000000 to 0010FFFF)"

let toCharOrSurrogatePair p =
    p |> withSkippedString (fun codePoint _ -> System.Int32.Parse(codePoint, System.Globalization.NumberStyles.HexNumber) |> System.Char.ConvertFromUtf32)

let pStandardBackslashEscape =
    anyOf "\\\"bfnrt"
    |>> function
        | 'b' -> "\b"      // U+0008 BACKSPACE
        | 'f' -> "\u000c"  // U+000C FORM FEED
        | 'n' -> "\n"      // U+000A LINE FEED
        | 'r' -> "\r"      // U+000D CARRIAGE RETURN
        | 't' -> "\t"      // U+0009 CHARACTER TABULATION a.k.a. Tab or Horizonal Tab
        | c   -> string c

let pUnicodeEscape =     (pchar 'u' >>. (pShortCodepointInHex |> toCharOrSurrogatePair))
                     <|> (pchar 'U' >>. ( pLongCodepointInHex |> toCharOrSurrogatePair))

let pEscapedChar = pstring "\\" >>. (pStandardBackslashEscape <|> pUnicodeEscape)

let quote = pchar '"'
let isBasicStrChar c = c <> '\\' && c <> '"' && c > '\u001f' && c <> '\u007f'
let pBasicStrChars = manySatisfy isBasicStrChar
let pBasicStr = stringsSepBy pBasicStrChars pEscapedChar |> between quote quote

let pEscapedNewline = skipChar '\\' .>> skipNewline .>> spaces
let isMultilineStrChar c = c = '\n' || isBasicStrChar c
let pMultilineStrChars = manySatisfy isMultilineStrChar


let pTripleQuote = pstring "\"\"\""

let pMultilineStr = stringsSepBy pMultilineStrChars (pEscapedChar <|> (notFollowedByString "\"\"\"" >>. pstring "\"")) |> between pTripleQuote pTripleQuote

Upvotes: 5

Related Questions