mcmayer
mcmayer

Reputation: 2023

Use the Haskell monad "do" notation to define a syntax tree

I'm attempting to construct an abstract syntax tree that allows definition using the monad do notation like so:

ast = do
    Variable uint8 "i"
    Function Void "f" $ do
        Variable uint8 "local_y"
        Comment "etc. etc."

The construction I show here was gleaned from Text.Blaze.Html, where it is used to define an HTML tree.

Questions are scattered throughout the following. The main question is how to do this correctly. Any input that helps with understanding this construct is greatly appreciated, of course.

So, first of all, here's a small, flawed, but 'working' example. It's a syntax tree with declarations of variables and functions of certain type, comment lines, and a placeholder declaration that is used for substitutions:

{-# LANGUAGE ExistentialQuantification #-}
module Question
where

import           Control.Applicative
import           Data.Monoid         (Monoid, (<>))
import           Data.String.Utils   (rstrip)

type NumberOfBits = Word
type VariableName = String

data Type = UInt NumberOfBits
          | Int NumberOfBits
          | Void

uint8 = UInt 8
int8 = Int 8

instance Show Type where
    show (UInt w) = "uint" <> show w
    show (Int w)  = "int" <> show w
    show Void     = "void"

data TreeM a = Variable Type VariableName            -- variable declaration
             | Function Type VariableName (TreeM a)  -- function declaration
             | Comment String                        -- a comment
             | PlaceHolder String                    -- a placeholder with                  
             | forall b. Append (TreeM b) (TreeM a)  -- combiner
             | Empty a                               -- needed for what?

type Tree = TreeM ()

subTreeOf :: TreeM a -> a
subTreeOf (Variable _ _)   = undefined
subTreeOf (Function _ _ t) = subTreeOf t
subTreeOf (Comment _)      = undefined
subTreeOf (Empty t)        = t

instance Monoid a => Monoid (TreeM a) where
    mempty = Empty mempty
    mappend = Append
    mconcat = foldr Append mempty

instance Functor TreeM where
    fmap f x = x `Append` (Empty (f (subTreeOf x))) -- fmap :: (a -> b) -> f a -> f b

instance Applicative TreeM where
    pure x = Empty x
    (<*>) x y = (x `Append` y) `Append` (Empty (subTreeOf x (subTreeOf y)))  -- (<*>) :: f (a -> b) -> f a -> f b
    (*>) = Append

instance Monad TreeM where
    return x = Empty x
    (>>) = Append             -- not really needed: (>>) would default to (*>)
    t >>= f = t `Append` (f (subTreeOf t))

indent :: String -> String
indent s = rstrip $ unlines $ map ("    "<>) (lines s)

render :: TreeM a -> String
render (Variable y n)   = "Variable " <> (show y) <> " " <> show n
render (Function r n t) = "Function" <> " " <> n <> " returning " <> (show r) <> ":\n" <> indent (render t)
render (PlaceHolder n)  = "Placeholder \"" <> n <> "\""
render (Append t t')    = (render t) <> "\n" <> (render t')
render (Empty _)        = ""

-- |In input tree t substitute a PlaceHolder of name n' with the Tree t'
sub :: TreeM a -> (String, TreeM a) -> TreeM a
sub t@(PlaceHolder n) (n', t') = if n == n' then t' else t
sub (Function y n t) s         = Function y n (sub t s)
--sub (Append t t') s            = Append (sub t s) (sub t' s)  -- Error!
sub t _                        = t

code :: Tree
code = do
    Variable uint8 "i"
    Variable int8 "j"
    Function Void "f" $ do
        Comment "my function f"
        Variable int8 "i1"
        Variable int8 "i2"
    PlaceHolder "the_rest"

main :: IO ()
main = do
    putStrLn $ render code
    putStrLn "\nNow apply substitution:\n"
    putStrLn $ render (sub code ("the_rest", Comment "There is nothing here"))

This is (should be) a really neat way to define complex tree structures. In particular, this should be the syntactically least noisy, user-friendly, way to define a syntax tree.

In general I struggle to understand the exact meaning of the a in TreeM a. The way I see it the a can be any of the types Variable, Function, PlaceHolder, etc.

I note a few things that strike me as odd:

  1. In forall b. Append (TreeM b) (TreeM a) the order of TreeM a and TreeM b arguments to Append seem to be reversed. In any case, the use of a existential quantifier in a sum type looks strange. If I understand this correctly, it defines a family of constructors for TreeM.
  2. Of all the functions required by Functor, Applicative and Monad the only one actually used is the monad >>. (This indicates that a free monad might be the right tool for this job.) It actually never occurred to me that the do notation employs the >> operator and that this fact can be used.
  3. An undefined has to be used in subTreeOf in order to make the function total.

As already noted, the example above is flawed: Parts of the construct are not appropriate for ASTs:

  1. The definition of Empty makes sense for HTML trees, it is used for empty tags like <br />. But for the AST it makes no sense. It was left as is to keep the Applicative and Functor implementations working.
  2. Likewise, the implementations of Functor and Applicative may make sense for HTML trees, but not for the AST. Even for HTML I don't quite understand the purpose of the fmap and applicative <*>. Both extend the tree by pushing down a node and adding an Empty type. I don't quite see which natural transformation on HTML trees that represents.

I'm surprised the subTreeOf x (subTreeOf y) in the definition of the applicative <*> is actually correct syntax, or is there an implicit >>?

AST transformations

It is natural to apply transformations on ASTs. The PlaceHolder serves as a little toy for applying a transformation. The function sub, having only a partial implementation, here, should substitute the placeholder "the_rest" with a comment. The necessary sub (Append t t') s = Append (sub t s) (sub t' s) does not compile however, the expected type of s is (String, TreeM b), the actual type is (String, TreeM a). Changing the type to sub :: TreeM a -> (String, TreeM b) -> TreeM a, on the other hand, breaks the definition of sub p@(PlaceHolder n) and now I'm stuck.

In fact, isn't this sub exactly what fmap for ASTs should be?

Free monad?

The term 'free monad' regularly pops up when monads for ASTs are discussed. But the free monad relies on the Functor fmap for the free construction and the fmap shown here is not adequate for ASTs. Once the correct fmap is identified the free monad should do the rest - maybe.

fmap

It seems that a correct fmap is the key for success here, and the correct <*> will probably become more obvious.

Use-cases

Loops can be written with forM_, a nice way to build up repetitive parts of ASTs:

forM_ ["you", "get", "the", "idea"] $ \varName -> do
    Variable uint8 varName

Conditional parts can use when, unless, etc.

when hasCppDestructor $ do
    Comment "We need the destructor"
    Function NoReturnType "~SomeClass" $ do
        ...

Semantic analysis, e.g. ensuring correct declaration order, is also possible as was pointed out in the first answer.

Visual clues: Another thing I like is that in the construct shown above control structures such as if-then-else, forM_ etc. start lower-case whereas the AST lines start upper-case.

Background

A few words about where this is heading, possibly: The idea is to use a nice enough embedded DSL that allows to automatically define an AST that rather abstractly represents, say, a convoluted FSM that needs to be implemented in C, C++, Python, Java, Go, Rust, Javascript, whatever... A render function like the one above then maps the certifiably correct AST to the target language.

Updates

Upvotes: 5

Views: 880

Answers (2)

mcmayer
mcmayer

Reputation: 2023

The path I took with this Append encoding of the AST seems to be a dead end, so I dug deeper into free monads. Here's the result:

The free monad is a good fit for this type of problem. Free monads allow to separate the 'logic' of a program from its effects. ASTs fall into this pattern. In this example the 'logic' is the AST and the effect is just pretty printing.

More generally, 'effect' can mean analysis, testing (dry runs, for example), running proofs, pretty printing, compressing, ..., and, of course, actual execution.

A lot has been written about free monads, here are a few useful resources to get started:

Now, using Control.Monad.Free a solution would look like this:

{-# LANGUAGE DeriveFunctor #-}
module Main where

import           Control.Monad.Free
import           Data.Monoid        ((<>))
import           Data.String.Utils  (rstrip)

type NumberOfBits = Word
type VariableName = String

data Type = UInt NumberOfBits
        | Int NumberOfBits
        | Void
        deriving Eq

uint8 = UInt 8
int8 = Int 8

instance Show Type where
    show (UInt w) = "uint" <> show w
    show (Int w)  = "int" <> show w
    show Void     = "void"

data AST n = Variable Type VariableName n                -- variable declaration
            | Function Type VariableName (Free AST ()) n -- function declaration
            | Comment String n                           -- a comment
            | PlaceHolder String n                       -- a placeholder with @name holds holds more code
            | End                                        
            deriving (Eq, Show, Functor)

end :: Free AST ()
end = liftF End -- is exactly Pure ()

variable :: Type -> VariableName -> Free AST ()
variable y n = liftF (Variable y n ())

function :: Type -> VariableName -> Free AST () -> Free AST ()
function t n p = liftF (Function t n p ())

placeHolder :: String -> Free AST ()
placeHolder n = liftF (PlaceHolder n ())

comment :: String -> Free AST ()
comment c = liftF (Comment c ())

indent :: String -> String
indent s = rstrip $ unlines $ map ("    "<>) (lines s)

render :: Free AST r -> String
render (Free (Variable y n next))   = "Variable " <> show y <> " " <> show n <> "\n" <> render next
render (Free (Function t n f next)) = "Function \"" <> n <> "\" returning " <> show t <> ":\n"
                                    <> indent (render f) <> "\n" <> render next
render (Free (Comment c next))      = "//  " <> c <> "\n" <> render next
render (Free (PlaceHolder s next))  = "PlaceHolder \"" <> s <> "\"\n" <> render next
render (Free End)                   = "end"
render (Pure r)                     = "return\n"

code :: Free AST ()
code = do
    placeHolder "includefiles"
    variable uint8 "i"
    variable int8 "j"
    function Void "f" $ do
        comment "This is a function!"
        variable (Int 8) "local_i"

sub :: AST (Free AST b) -> Free AST b
sub (Variable t n next) = do
    variable t n
    next
sub (Function t n f next) = do
    function t n f
    next
sub (Comment c next) = do
    comment c
    next
sub (PlaceHolder s next) = do
    comment "placeholder"
    next

main :: IO ()
main = do
    putStrLn $ render code
    putStrLn "-- Apply subst\n"
    putStrLn $ render (iterM sub code)

Not all of this needs to be spelt out so explicitly. Some of the boilerplate can be removed by using Control.Monad.Free.TH.

Control.Monad.Free is, in a way, the canonical implementation, but the chained data structure means quadratic complexity on some operations. This was addressed by the author himself, Ed Kmett, in Control.Monad.Free.Church, where a different encoding is used. See the free monad benchmark for benchmarks and pointers to other free monad implementations.

Going beyond free monads, cofree monads formalize the interpreter and its relationship with the 'logic'. See, for example, "Free for DSLs, cofree for interpreters" by David Laing.

Upvotes: 1

leftaroundabout
leftaroundabout

Reputation: 120711

I'm not sure if this whole approach is a good idea (though, I've actually though a number of times of doing something similar myself).

Note that monads like Blaze.MarkupM, HaTeX.LaTeXM etc. aren't really much of a monad. They're really just monoids that want access to monadic combinators (mostly, to abuse do notation, but it also allows stack monad transformers on top, which can make quite some sense). I.e., they're nothing but specialised Writer monads!
At the moment, you're really doing the same thing; if that's what you're intending then maybe the best way to go about it would be to just design your type as a Monoid Tree, then look at the structure of the Writer Tree monad and, if desired, refactor it into a TreeM data structure. (HaTeX doesn't do this but keeps LaTeX and LaTeXM separate types with only a common class interface, which is arguably a cleaner approach, though it may be not optimal for performance.)

The result will be pretty like Blaze.MarkupM / the structure you have right now. I could discuss your individual questions, but really, they can all be answered by just looking at how the type is isomorphic to the writer monad.


Actually you don't need a Monad instance at all to use do, as such:

Prelude> 2 * do 1 + 1
4

So if you just want to abuse do to avoid parentheses in a tree layout, but don't really have a sensible way of stashing bindable variables in your structure, consider not writing any monad instance. That instance is only needed for a do block with multiple lines, but if none of these lines bind any variables then you can always just replace the implicit >> with an explicit <>, like

    Function Void "f" $ do
        Variable uint8 "local_y"
     <> Comment "etc. etc."

The only problem really is: these lines can't include the $ operator, because that has lower precedence than <>. One nifty way to circumvent this is to observe that ($) = id, so you can write your example as

ast = do
    Variable uint8 "i"
 <> Function Void "f" `id`do
        Variable uint8 "local_y"
     <> Comment "etc. etc."

Whether this is even more abuse of syntax than defining a not-much-of-a-monad instance is debatable. IMO, if you define such a monad instance you should right away make it a monad transformer, like HaTeX does, because that also gives the option to allow including IO actions in your AST build (for example, to hard-include external source files).


All that said: for your application, it might actually make sense to have a Monad instance that's not just a “sugared up monoid” but actually binds, well, variables in a useful way. That's a feature not applicable to blaze, but certainly to C++/Python/JavaScript language like ASTs, and it could be quite useful because it ensures variables are defined before use, right within the Haskell syntax. Instead of your example, you'd write

ast = do
    i <- variable uint8
    Function Void "f" $ do
        local_y <- variable uint8
        Comment "etc. etc."

The variables would under the hood then actually just be numbered identifiers, chosen according to a state variable.

Implementation would be roughly like this:

type VariableName = Int

data TreeS = Variable Type VariableName
           | Function Type VariableName TreeS
           | Comment String
           | PlaceHolder String
           | Append TreeS TreeS
           | Empty
instance Monoid where (<>) = Append

newtype TreeT m a
    = TreeT { runTreeM :: StateT VariableName (WriterT TreeS m) a }
    deriving (Functor, Applicative, Monad)

variable :: Type -> TreeT m VariableName
variable typ = TreeT $ do
   i <- get
   lift . tell $ Variable typ i
   put $ i+1
   return i

Upvotes: 2

Related Questions