Reputation: 50376

Constructing custom expression trees while using operators in C#

This question is about constructing custom expression trees in .NET using the operators found in C# (or any other language). I provide the question along with some the background information.

For my managed 2-phase 64-bit assembler I need support for expressions. For example, one might want to assemble:

mystring: DB 'hello, world'
          TIMES 64-$+mystring DB ' '

The expression 64-$+mystring must not be a string but an actual valid expression with the benefits of syntax and type checking and IntelliSense in VS, something along the lines of:

64 - Reference.CurrentOffset + new Reference("mystring");

This expression is not evaluated when it is constructed. Instead, it is evaluated later in my assembler's context (when it determines the symbol offsets and such). The .NET framework (since .NET 3.5) provides support for expressions trees, and it seems to me that it is ideal for this kind of expressions which are evaluated later or somewhere else.

But I don't know how to ensure that I can use the C# syntax (using +, <<, %, etc..) for constructing the expression tree. I want to prevent things like:

var expression = AssemblerExpression.Subtract(64,
    AssemblerExpression.Add(AssemblerExpression.CurrentOffset(),
        AssemblerExpression.Reference("mystring")))

How would you go about this?

Note: I need an expression tree to be able to convert the expression into an acceptable custom string representation, and at the same time be able to evaluate it at a point in time other than at its definition.

An explanation of my example: 64-$+mystring. The $ is the current offset, so it is a specific number that is unknown in advance (but known at evaluation time). The mystring is a symbol which may or may not be known at evaluation time (for example when it has not yet been defined). Subtracting a constant C from a symbol S is the same as S + -C. Subtracting two symbols S0 and S1 (S1 - S0) gives the integer difference between the two symbol's values.

However, this question is not really about how to evaluate assembler expressions, but more about how to evaluate any expression that has custom classes in them (for things like the symbols and $ in the example) and how to still ensure that it can be pretty-printed using some visitor (thus keeping the tree). And since the .NET framework has its expression trees and visitors, it would be nice to use those, if possible.

Upvotes: 10

Answers (4)

Ira Baxter

Reputation: 95420

You are implementing a two phase (pass?) assembler? The purpose of a two pass assembler is to handle forward references (e.g., symbol that are undefined when first encountered).

Then you pretty much don't need to build an expression tree.

In phase (pass 1), you parse the source text (by any means you like: ad hoc parser, recursive descent, parser generator) and collect values of symbols (in particular, the relative values of labels with respect to the code or data section in which they are contained. If you encounter an expression, you attempt to evaluate it using on-the-fly expression evalution, typically involving a push down stack for subexpressions, and producing a final result. If you encounter a symbol whose value is undefined, you propagate the undefinedess as the expression result. If the assembly operator/command needs the expression value to define a symbol (eg., X EQU A+2) or to determine offsets into a code/data section (e.g, DS X+23), then the value must be defined or the assembler throws an error. This allows ORG A+B-C to work. Other assembly operators that don't need the value during pass one simply ignore the undefined result (e.g., LOAD ABC doesn't care what ABC is, but can determine the length of the LOAD instruction).

In phase (pass II), you re-parse the code the same way. This time all the symbols have values, so all expressions should evaluate. Those that had to have a value in Phase I are checked against the values produced in Phase II to ensure they are identical (otherwise you get a PHASE error). Other assembly operators/instructions now have enough information to generate the actual machine instructions or data initializations.

The point is, you never have to build an expression tree. You simply evaluate the expression as you encounter it.

If you built a one pass assembler, you might need to model the expression to allow re-evaluation later. I found it easier to produce reverse polish as sequence of "PUSH value" and arithop, and store the sequence (equivalent to the expression tree), because it is dense (trees are not) and trivial to evaluate by doing a linear scan using (as above) a small pushdown stack.

In fact what I did was to produce reverse polish that in fact acted as the expression stack itself; during a linear scan, if operands could be evaluated they were replaced by a "PUSH value" command, and the remaining reverse polish is squeezed to remove the bubble. This isnt expensive because most expressions are actually tiny. And it meant that any expression that had to saved for later evaluation was as small as possible. If you threaded the PUSH identifier commands through the symbol table, then when as symbol becomes defined, you can fill in all the partially evaluated expressions and reevaluate them; the ones that produce a single value are then processed and their space recycled. This allowed me to assemble giant programs in a 4K word, 16 bit machine, back in 1974, because most forward references don't really reach very far.

Upvotes: 2

Iridium

Reputation: 23731

Again, not quite sure if this is exactly what you're looking for, but from the starting point of wanting to create some kind of expression tree using C# syntax, I've come up with...

public abstract class BaseExpression
{
    // Maybe a Compile() method here?
}

public class NumericExpression : BaseExpression
{
    public static NumericExpression operator +(NumericExpression lhs, NumericExpression rhs)
    {
        return new NumericAddExpression(lhs, rhs);
    }

    public static NumericExpression operator -(NumericExpression lhs, NumericExpression rhs)
    {
        return new NumericSubtractExpression(lhs, rhs);
    }

    public static NumericExpression operator *(NumericExpression lhs, NumericExpression rhs)
    {
        return new NumericMultiplyExpression(lhs, rhs);
    }

    public static NumericExpression operator /(NumericExpression lhs, NumericExpression rhs)
    {
        return new NumericDivideExpression(lhs, rhs);
    }

    public static implicit operator NumericExpression(int value)
    {
        return new NumericConstantExpression(value);
    }

    public abstract int Evaluate(Dictionary<string,int> symbolTable);
    public abstract override string ToString();
}

public abstract class NumericBinaryExpression : NumericExpression
{
    protected NumericExpression LHS { get; private set; }
    protected NumericExpression RHS { get; private set; }

    protected NumericBinaryExpression(NumericExpression lhs, NumericExpression rhs)
    {
        LHS = lhs;
        RHS = rhs;
    }

    public override string ToString()
    {
        return string.Format("{0} {1} {2}", LHS, Operator, RHS);
    }
}

public class NumericAddExpression : NumericBinaryExpression
{
    protected override string Operator { get { return "+"; } }

    public NumericAddExpression(NumericExpression lhs, NumericExpression rhs)
        : base(lhs, rhs)
    {
    }

    public override int Evaluate(Dictionary<string,int> symbolTable)
    {
        return LHS.Evaluate(symbolTable) + RHS.Evaluate(symbolTable);
    }
}

public class NumericSubtractExpression : NumericBinaryExpression
{
    protected override string Operator { get { return "-"; } }

    public NumericSubtractExpression(NumericExpression lhs, NumericExpression rhs)
        : base(lhs, rhs)
    {
    }

    public override int Evaluate(Dictionary<string, int> symbolTable)
    {
        return LHS.Evaluate(symbolTable) - RHS.Evaluate(symbolTable);
    }
}

public class NumericMultiplyExpression : NumericBinaryExpression
{
    protected override string Operator { get { return "*"; } }

    public NumericMultiplyExpression(NumericExpression lhs, NumericExpression rhs)
        : base(lhs, rhs)
    {
    }

    public override int Evaluate(Dictionary<string, int> symbolTable)
    {
        return LHS.Evaluate(symbolTable) * RHS.Evaluate(symbolTable);
    }
}

public class NumericDivideExpression : NumericBinaryExpression
{
    protected override string Operator { get { return "/"; } }

    public NumericDivideExpression(NumericExpression lhs, NumericExpression rhs)
        : base(lhs, rhs)
    {
    }

    public override int Evaluate(Dictionary<string, int> symbolTable)
    {
        return LHS.Evaluate(symbolTable) / RHS.Evaluate(symbolTable);
    }
}

public class NumericReferenceExpression : NumericExpression
{
    public string Symbol { get; private set; }

    public NumericReferenceExpression(string symbol)
    {
        Symbol = symbol;
    }

    public override int Evaluate(Dictionary<string, int> symbolTable)
    {
        return symbolTable[Symbol];
    }

    public override string ToString()
    {
        return string.Format("Ref({0})", Symbol);
    }
}

public class StringConstantExpression : BaseExpression
{
    public string Value { get; private set; }

    public StringConstantExpression(string value)
    {
        Value = value;
    }

    public static implicit operator StringConstantExpression(string value)
    {
        return new StringConstantExpression(value);
    }
}

public class NumericConstantExpression : NumericExpression
{
    public int Value { get; private set; }

    public NumericConstantExpression(int value)
    {
        Value = value;
    }

    public override int Evaluate(Dictionary<string, int> symbolTable)
    {
        return Value;
    }

    public override string ToString()
    {
        return Value.ToString();
    }
}

Now, obviously none of these classes actually do anything (you'd probably want a Compile() method on there amongst others) and not all the operators are implemented, and you can obviously shorten the class names to make it more concise etc... but it does allow you to do things like:

var result = 100 * new NumericReferenceExpression("Test") + 50;

After which, result will be:

NumericAddExpression
- LHS = NumericMultiplyExpression
        - LHS = NumericConstantExpression(100)
        - RHS = NumericReferenceExpression(Test)
- RHS = NumericConstantExpression(50)

It's not quite perfect - if you use the implicit conversions of numeric values to NumericConstantExpression (instead of explicitly casting/constructing them), then depending on the ordering of your terms, some of the calculations may be performed by the built-in operators, and you'll only get the resulting number (you could just call this a "compile-time optimization"!)

To show what I mean, if you were to instead run this:

var result = 25 * 4 * new NumericReferenceExpression("Test") + 50;

in this case, the 25 * 4 is evaluated using built-in integer operators, so the result is actually identical to the above, rather than building an additional NumericMultiplyExpression with two NumericConstantExpressions (25 and 4) on the LHS and RHS.

These expressions can be printed using ToString() and evaluated, if you provide a symbol table (here just a simple Dictionary<string, int>):

var result = 100 * new NumericReferenceExpression("Test") + 50;
var symbolTable = new Dictionary<string, int>
{
    { "Test", 30 }
};
Console.WriteLine("Pretty printed: {0}", result);
Console.WriteLine("Evaluated: {0}", result.Evaluate(symbolTable));

Results in:

Pretty printed: 100 * Ref(Test) + 50
Evaluated: 3050

Hopefully despite the drawback(s) mentioned, this is something approaching what you were looking fo (or I've just wasted the last half hour!)

Upvotes: 2

sehe

Reputation: 393934

I don't know what exactly you are aiming for, but the following is some sketchy approach that I think would work.

Note I

demonstrate only indexed reference expressions (thus ignoring indirect addressing via registers for now; you could add a RegisterInderectReference analogous to the SymbolicReference class). This also goes for you suggested $ (current offset) feature. It would probably be sure a register (?)
doesn't explicitely show the unary/binary operator- at work either. However, the mechanics are largely the same. I stopped short of adding it because I couldn't work out the semantics of the sample expressions in your question
_{(I'd think that subtracting the address of a known string is not useful, for example)}
the approach does not place (semantic) limits: you can offset any ReferenceBase derived IReference. In practice, you might only want to allow one level of indexing, and defining the operator+ directly on SymbolicReference would be more appropriate.
Has sacrificed coding style for demo purposes (in general, you'll not want to repeatedly Compile() your expression trees, and direct evaluation with .Compile()() looks ugly and confusing. It's left up to the OP to integrate it in a more legible fashion
The demonstration of the explicit conversion operator is really off-topic. I got carried away slighlty (?)
You can observe the code running live on IdeOne.com

using System;
using System.Collections.Generic;
using System.Linq.Expressions;
using System.Linq;


namespace Assembler
{
    internal class State
    {
        public readonly IDictionary<string, ulong> SymbolTable = new Dictionary<string, ulong>();

        public void Clear() 
        {
            SymbolTable.Clear();
        }
    }

    internal interface IReference
    {
        ulong EvalAddress(State s); // evaluate reference to address
    }

    internal abstract class ReferenceBase : IReference
    {
        public static IndexedReference operator+(long directOffset, ReferenceBase baseRef) { return new IndexedReference(baseRef, directOffset); }
        public static IndexedReference operator+(ReferenceBase baseRef, long directOffset) { return new IndexedReference(baseRef, directOffset); }

        public abstract ulong EvalAddress(State s);
    }

    internal class SymbolicReference : ReferenceBase
    {
        public static explicit operator SymbolicReference(string symbol)    { return new SymbolicReference(symbol); }
        public SymbolicReference(string symbol) { _symbol = symbol; }

        private readonly string _symbol;

        public override ulong EvalAddress(State s) 
        {
            return s.SymbolTable[_symbol];
        }

        public override string ToString() { return string.Format("Sym({0})", _symbol); }
    }

    internal class IndexedReference : ReferenceBase
    {
        public IndexedReference(IReference baseRef, long directOffset) 
        {
            _baseRef = baseRef;
            _directOffset = directOffset;
        }

        private readonly IReference _baseRef;
        private readonly long _directOffset;

        public override ulong EvalAddress(State s) 
        {
            return (_directOffset<0)
                ? _baseRef.EvalAddress(s) - (ulong) Math.Abs(_directOffset)
                : _baseRef.EvalAddress(s) + (ulong) Math.Abs(_directOffset);
        }

        public override string ToString() { return string.Format("{0} + {1}", _directOffset, _baseRef); }
    }
}

namespace Program
{
    using Assembler;

    public static class Program
    {
        public static void Main(string[] args)
        {
            var myBaseRef1 = new SymbolicReference("mystring1");

            Expression<Func<IReference>> anyRefExpr = () => 64 + myBaseRef1;
            Console.WriteLine(anyRefExpr);

            var myBaseRef2 = (SymbolicReference) "mystring2"; // uses explicit conversion operator

            Expression<Func<IndexedReference>> indexedRefExpr = () => 64 + myBaseRef2;
            Console.WriteLine(indexedRefExpr);

            Console.WriteLine(Console.Out.NewLine + "=== show compiletime types of returned values:");
            Console.WriteLine("myBaseRef1     -> {0}", myBaseRef1);
            Console.WriteLine("myBaseRef2     -> {0}", myBaseRef2);
            Console.WriteLine("anyRefExpr     -> {0}", anyRefExpr.Compile().Method.ReturnType);
            Console.WriteLine("indexedRefExpr -> {0}", indexedRefExpr.Compile().Method.ReturnType);

            Console.WriteLine(Console.Out.NewLine + "=== show runtime types of returned values:");
            Console.WriteLine("myBaseRef1     -> {0}", myBaseRef1);
            Console.WriteLine("myBaseRef2     -> {0}", myBaseRef2);
            Console.WriteLine("anyRefExpr     -> {0}", anyRefExpr.Compile()());     // compile() returns Func<...>
            Console.WriteLine("indexedRefExpr -> {0}", indexedRefExpr.Compile()());

            Console.WriteLine(Console.Out.NewLine + "=== observe how you could add an evaluation model using some kind of symbol table:");
            var compilerState = new State();
            compilerState.SymbolTable.Add("mystring1", 0xdeadbeef); // raw addresses
            compilerState.SymbolTable.Add("mystring2", 0xfeedface);

            Console.WriteLine("myBaseRef1 evaluates to     0x{0:x8}", myBaseRef1.EvalAddress(compilerState));
            Console.WriteLine("myBaseRef2 evaluates to     0x{0:x8}", myBaseRef2.EvalAddress(compilerState));
            Console.WriteLine("anyRefExpr displays as      {0:x8}",   anyRefExpr.Compile()());
            Console.WriteLine("indexedRefExpr displays as  {0:x8}",   indexedRefExpr.Compile()());
            Console.WriteLine("anyRefExpr evaluates to     0x{0:x8}", anyRefExpr.Compile()().EvalAddress(compilerState));
            Console.WriteLine("indexedRefExpr evaluates to 0x{0:x8}", indexedRefExpr.Compile()().EvalAddress(compilerState));
        }
    }
}

Upvotes: 5

Iridium

Reputation: 23731

C# supports assigning a lambda expression to an Expression<TDelegate>, which will cause the compiler to emit code to create an expression tree representing the lambda expression, which you can then manipulate. E.g.:

Expression<Func<int, int, int>> times = (a, b) => a * b;

You could then potentially take the generated expression tree and convert it into your assembler's syntax tree, but this doesn't seem to be quite what you're looking for, and I don't think you're going to be able to leverage the C# compiler to do this for arbitrary input.

You're probably going to end up having to build your own parser for your assembly language, as I don't think the C# compiler is going to do what you want in this case.

Upvotes: 4

Constructing custom expression trees while using operators in C#

Answers (4)

Related Questions