Parsing an ML-like syntax based on indentation, and everything considered to be an instruction/expression

Question

NOTE: Not long ago, I had already asked a similar question. This is not a duplication, but the clarifications to be requested did not fall within the scope of the subject itself. I therefore allow myself to open another position dealing with the analysis of an ML-like syntax based on indentation, and considering everything as an instruction / expression.

For example: "Hello" is an expression, let foo = 2 + 1 is an instruction using an expression (2 + 1), print foo is an instruction, ...

In short, a syntax and semantics that is quite modular and dynamic. Like F#, or OCaml.

To do this, I use F#, with the API (available on nuget) FParsec. The FParsec wiki provides an example of a syntax based on indentation, so I have taken it up again. The module in the code below used is IndentationParserWithoutBacktracking.

The example code to be parsed uses an elementary indentation, not mixing "literal" and "instructions/expressions":

loop i 1 10
  loop k 1 10
    print k
  print i
print j

A simple code, and without context (but this is not important at the moment).

My implementation allows codes such as:

let foo = a + b

let foo =
    let a = 9
    let b = 1
    a + b

let foo = 7

let foo =
    loop i 1 10
        print i

For example. (The loop and print are there just for the tests...)

The problem I have been having for a long week now, and that I can't solve, is the fact that the indentation module asks me every time an instruction is expected in a parser for a new line... Here is a screenshot:

This applies to all the examples mentioned above. I don't really understand the problem, and therefore don't know how to solve it.

Here is the code tested for this question, it meets the minimum and functional code criteria, however, FParsec must be used:

open FParsec

// This module come from 'https://github.com/stephan-tolksdorf/fparsec/wiki/Parsing-indentation-based-syntax-with-FParsec'
// I used the second module: 'IndentationParserWithoutBacktracking'

module IndentationParserWithoutBacktracking =

    let tabStopDistance = 8

    type LastParsedIndentation() =
        []
        val mutable Value: int32
        []
        val mutable EndIndex: int64

    type UserState = 
        {Indentation: int
         // We put LastParsedIndentation into the UserState so that we 
         // can conveniently use a separate instance for each stream.
         // The members of the LastParsedIndentation instance will be mutated
         // directly and hence won't be affected by any stream backtracking. 
         LastParsedIndentation: LastParsedIndentation}
        with
           static member Create() = {Indentation = -1
                                     LastParsedIndentation = LastParsedIndentation(EndIndex = -1L)}

    type CharStream = CharStream
    type Parser<'t> = Parser<'t, UserState>

    // If this function is called at the same index in the stream
    // where the function previously stopped, then the previously
    // returned indentation will be returned again. 
    // This way we can avoid backtracking at the end of indented blocks.
    let skipIndentation (stream: CharStream) =    
        let lastParsedIndentation = stream.UserState.LastParsedIndentation
        if lastParsedIndentation.EndIndex = stream.Index then
            lastParsedIndentation.Value
        else
            let mutable indentation = stream.SkipNewlineThenWhitespace(tabStopDistance, false)
            lastParsedIndentation.EndIndex <- stream.Index
            lastParsedIndentation.Value <- indentation
            indentation

    let indentedMany1 (p: Parser<'t>) label : Parser<'t list> =
        fun stream ->
            let oldIndentation = stream.UserState.Indentation
            let indentation = skipIndentation stream
            if indentation <= oldIndentation then 
                Reply(Error, expected (if indentation < 0 then "newline" else "indented " + label))
            else
                stream.UserState <- {stream.UserState with Indentation = indentation}            
                let results = ResizeArray()
                let mutable stateTag = stream.StateTag
                let mutable reply = p stream // parse the first element
                let mutable newIndentation = 0
                while reply.Status = Ok 
                      && (results.Add(reply.Result)
                          newIndentation <- skipIndentation stream
                          newIndentation = indentation)
                   do
                     stateTag <- stream.StateTag
                     reply <- p stream
                if reply.Status = Ok 
                   || (stream.IsEndOfStream && results.Count > 0 && stream.StateTag = stateTag) 
                then
                    if newIndentation < indentation || stream.IsEndOfStream then
                        stream.UserState <- {stream.UserState with Indentation = oldIndentation}
                        Reply(List.ofSeq results)
                    else
                        Reply(Error, messageError "wrong indentation")
                else // p failed
                    Reply(reply.Status, reply.Error) 

open IndentationParserWithoutBacktracking

let isBlank = fun c -> c = ' ' || c = '	'
let ws  = spaces
let ws1 = skipMany1SatisfyL isBlank "whitespace"
let str s = pstring s .>> ws

let keyword str = pstring str >>? nextCharSatisfiesNot (fun c -> isLetter c || isDigit c)  str

// AST

type Identifier = Identifier of string

// A value is just a literal or a data name, called here "Variable"
type Value =
    | Int of int   | Float of float
    | Bool of bool | String of string
    | Char of char | Variable of Identifier

// All is an instruction, but there are some differences:
type Instr =
    // Arithmetic
    | Literal of Value   | Infix of Instr * InfixOp * Instr
    // Statements (instructions needing another instructions)
    | Let of Identifier * Instr list
    | Loop of Identifier * Instr * Instr * Instr list
    // Other - the "print" function, from the link seen above
    | Print of Identifier
and InfixOp =
    // Arithmetic
    | Sum | Sub | Mul | Div
    // Logic
    | And | Or | Equal | NotEqual | Greater | Smaller | GreaterEqual | SmallerEqual

// Literals

let numberFormat = NumberLiteralOptions.AllowMinusSign   ||| NumberLiteralOptions.AllowFraction |||
                   NumberLiteralOptions.AllowHexadecimal ||| NumberLiteralOptions.AllowOctal    |||
                   NumberLiteralOptions.AllowBinary

let literal_numeric =
    numberLiteral numberFormat "number" |>> fun nl ->
        if nl.IsInteger then Literal (Int(int nl.String))
        else Literal (Float(float nl.String))

let literal_bool = 
    (choice [
        (stringReturn "true" (Literal (Bool true)))
        (stringReturn "false" (Literal (Bool false)))
    ]
    .>> ws)  "boolean"

let literal_string = 
    (between (pstring """) (pstring """) (manyChars (satisfy (fun c -> c <> '"')))
    |>> fun s -> Literal (String s))  "string"

let literal_char = 
    (between (pstring "'") (pstring "'") (satisfy (fun c -> c <> '''))
    |>> fun c -> Literal (Char c))  "character"

let identifier =
    (many1Satisfy2L isLetter (fun c -> isLetter c || isDigit c) "identifier"
    |>> Identifier)  "identifier"

let betweenParentheses p =
    (between (str "(") (str ")") p)  ""

let variable = identifier |>> fun id -> Literal (Variable id)

let literal = (attempt literal_numeric  <|>
               attempt literal_bool     <|>
               attempt literal_char     <|>
               attempt literal_string   <|>
               attempt variable)

// Instressions and statements

let pInstrs, pInstrimpl = createParserForwardedToRef()

// `ploop` is located here to force `pInstrs` to be of the type `Instr list`, `ploop` requesting an instression list.
let ploop =
    pipe4
        (keyword "loop" >>. ws1 >>. identifier)
        (ws1 >>. literal)
        (ws1 >>. literal)
        (pInstrs)
        (fun id min max stmts -> Loop(id, min, max, stmts))

// `singlepInstr` allows to use only one Instression, used just after.
let singlepInstr =
    pInstrs |>> fun ex -> ex.Head

let term =
    (ws >>. singlepInstr .>> ws) <|>
    (betweenParentheses (ws >>. singlepInstr)) <|>
    (ws >>. literal .>> ws) <|>
    (betweenParentheses (ws >>. literal))

let infixOperator (p: OperatorPrecedenceParser<_, _, _>) op prec map =
    p.AddOperator(InfixOperator(op, ws, prec, Associativity.Left, map))

let ops =
    // Arithmetic
    [ "+"; "-"; "*"; "/"; "%" ] @
    // Logical
    [ "&&"; "||"; "=="; "!="; ">"; "<"; ">="; "<=" ]

let opCorrespondance op =
    match op with
    // Arithmetic operators
    | "+"  -> Sum | "-"  -> Sub
    | "*"  -> Mul | "/"  -> Div
    // Logical operators
    | "&&" -> And           | "||" -> Or
    | "==" -> Equal         | "!=" -> NotEqual
    | ">"  -> Greater       | "<"  -> Smaller
    | ">=" -> GreaterEqual  | "<=" -> SmallerEqual
    | _ -> failwith ("Unknown operator: " + op)

let opParser = new OperatorPrecedenceParser()

for op in ops do
    infixOperator opParser op 1 (fun x y -> Infix(x, opCorrespondance op, y))

opParser.TermParser <- term

// Statements

(*
- let:

        let  = 

- print:

        print 

- loop:

        loop    

*)

let plet =
    pipe2
        (keyword "let" >>. ws1 >>. identifier)
        (ws >>. str "=" >>. ws >>. pInstrs)
        (fun id exp -> Let(id, exp))

let print =
    keyword "print" >>. ws1 >>. identifier 
    |>> Print

let instruction =
    print <|> ploop <|> plet <|>

    opParser.ExpressionParser <|>
    literal

pInstrimpl := indentedMany1 instruction "instruction"

let document = pInstrs .>> spaces .>> eof

let test str =
    match runParserOnString document (UserState.Create()) "" str with
        | Success(result, _, _)   -> printfn "%A" result
        | Failure(errorMsg, _, _) -> printfn "%s" errorMsg

System.Console.Clear()

let code = test @"
let foo = a + b
"

I would like to understand first of all why it doesn't work, but also to be able to find a solution to my problem, and that this solution can be extended to the potential syntax additions of the parser.

Awaiting a salutary answer, thank you.

Stephan Tolksdorf · Accepted Answer

In order to understand why your parser doesn't work, you need to isolate the issues.

If I understand you correctly, you want your let parser to support either a single instruction on the same line or indented instructions on subsequent lines, e.g:

let x = instruction
let b =
  instruction
  instruction

If you can't get your existing implementation to work, I'd recommend going back to the implementation on the Wiki and trying to just add support for the let statement.

For example, I made the Wiki parser accept simple let statements with the following modifications:

type Statement = Loop of Identifier * int * int * Statement list
               | Print of Identifier
               | Let of Identifier * Statement list

let ws = skipManySatisfy isBlank
let str s = pstring s .>> ws

let statement, statementRef = createParserForwardedToRef()

let indentedStatements = indentedMany1 statement "statement"

let plet = keyword "let" >>. pipe2 (ws1 >>. identifier)
                                   (ws >>. str "=" >>. ws
                                    >>. (indentedStatements
                                         <|> (statement |>> fun s -> [s])))
                                   (fun id exp -> Let(id, exp))
statementRef := print <|> loop <|> plet

Note that in the modified version statement is now the parser forwarded to a ref cell, not indentedStatements.

Note also that ws is not implemented with spaces, like in your parser. This is important because spaces also consumes newlines, which would prevent the indentedMany1 from seeing the newline and properly calculating the indentation.

The reason your parser produced an "Expecting: newline" error is that indentedMany1 needs a newline at the beginning of the indented sequence in order to be able to calculate the indentation. You would have to modify the implementation of indentedMany1 if you wanted to support e.g. the following indentation pattern:

let x = instruction
        instruction
        instruction

Parsing an ML-like syntax based on indentation, and everything considered to be an instruction/expression

Answers (1)

Related Questions