user3160514
user3160514

Reputation:

Good way to create an assembler?

I have an homework to do for my school. The goal is to create a really basic virtual machine as well as a simple assembler. I had no problem creating the virtual machine but I can't think of a 'nice' way to create the assembler.

The grammar of this assembler is really basic: an optional label followed by a colon, then a mnemonic followed by 1, 2 or 3 operands. If there is more than one operand they shall be separated by commas. Also, whitespaces are ignored as long as they don't occur in the middle of a word.

I'm sure I can do this with strtok() and some black magic, but I'd prefer to do it in a 'clean' way. I've heard about Parse Trees/AST, but I don't know how to translate my assembly code into these kinds of structures.

Upvotes: 1

Views: 2675

Answers (6)

Jerry Coffin
Jerry Coffin

Reputation: 490048

At least in my experience, normal lexer/parser generators (e.g., flex, bison/byacc) are all but useless for this task.

When I've done it, nearly the entire thing has been heavily table driven -- typically one table of mnemonics, and for each of those a set of indices into a table of instruction formats, specifying which formats are possible for that instruction. Depending on the situation, it can make sense to do that on a per-operand rather than a per-instruction basis (e.g., for mov instructions that have a fairly large set of possible formats for both the source and the destination).

In a typical case, you'll have to look at the format(s) of the operand(s) to determine the instruction format for a particular instruction. For a fairly typical example, a format of #x might indicate an immediate value, x a direct address, and @x an indirect address. Another common form for an indirect address is (x) or [x], but for your first assembler I'd try to stick to a format that specifies instruction format/addressing mode based only on the first character of the operand, if possible.

Parsing labels is simpler, and (mostly) separate. Basically, each label is just a name with an address.

As an aside, if possible I'd probably follow the typical format of a label ending with a colon (":") instead of a semicolon (";"). Much more often, a semicolon will mark the beginning of a comment.

Upvotes: 0

doynax
doynax

Reputation: 4455

For an assembler there's little need to build an explicit parse tree. Some assemblers do have fancy linkers capable of resolving complicated expressions at link-time time but for a basic assembler an ad-hoc lexer and parsers should do fine.

In essence you write a little lexer which consumes the input file character-by-character and classifies everything into simple tokens, e.g. numbers, labels, opcodes and special characters.

I'd suggest writing a BNF grammar even if you're not using a code generator. This specification may then be translated into a recursive-decent parser almost by-wrote. The parser simply walks through the whole code and emits assembled binary code along the way.

A symbol table registering every label and its value is also needed, traditionally implemented as a hash table. Initially when encountering an unknown label (say for a forward branch) you may not yet know the value however. So it is simply filed away for future reference.

The trick is then to spit out dummy values for labels and expressions the first time around but compute the label addresses as the program counter is incremented, then take a second pass through the entire file to fill in the real values.

For a simple assembler, e.g. no linker or macro facilities and a simple instruction set, you can get by with perhaps a thousand or so lines of code. Much of it brainless through-free hand translation from syntax descriptions and opcode tables.

Oh, and I strongly recommend that you check out the dragon book from your local university library as soon as possible.

Upvotes: 1

Robin Green
Robin Green

Reputation: 33033

I wrote an assembler like this when I was a teenager. You don't need a complicated parser at all.

All you need to do is five steps for each line:

  1. Tokenize (i.e. split the line into tokens). This will give you an array of tokens and then you don't need to worry about the whitespace, because you will have removed it during tokenization.
  2. Initialize some variables representing parts of the line to NULL.
  3. A sequence of if statements to walk over the token array and check which parts of the line are present. If they are present put the token (or a processed version of it) in the corresponding variable, otherwise leave that variable as NULL (i.e. do nothing).
  4. Report any syntax errors (i.e. combinations of types of tokens that are not allowed).
  5. Code generation - I guess you know how to do this part!

Upvotes: 4

mcleod_ideafix
mcleod_ideafix

Reputation: 11418

You can take a look at some already-made assemblers, like PASMO: an assmbler for Z80 CPU, and get ideas from it. Here it is: http://pasmo.speccy.org/

I've written a couple of very simple assemblers, both of them using string manipulation with strtok() and the like. For a simple grammar like the assembly language is, it's enough. Key pieces of my assemblers are:

A symbol table: just an array of structs, with the name of a symbol and its value.

typedef struct
{
    char nombre[256];
    u8 valor;
} TSymbol;

TSymbol tablasim[MAXTABLA];
int maxsim = 0;

A symbol is just a name that have associated a value. This value can be the current position (the address where the next instruction will be assembled), or it can be an explicit value assigned by the EQU pseudoinstruction.

Symbol names in this implementation are limited to 255 characters each, and one source file is limited to MAXTABLA symbols.

I perform two passes to the source code:

The first one is to identify symbols and store them in the symbol table, detecting whether they are followed by an EQU instruction or not. If there is such, the value next to EQU is parsed and assigned to the symbol. In other case, the value of the current position is assigned. To update the current position I have to detect if there is a valid instruction (although I do not assemble it yet) and update it acordingly (this is easy for me because my CPU has a fixed instruction size).

Here you have a sample of my code that is in charge of updating the symbol table with a value from EQU of the current position, and advancing the current position if needed.

case 1:
    if (es_equ (token))
    {
        token = strtok (NULL, "\n");
        tablasim[maxsim].valor = parse_numero (token, &err);
        if (err)
        {
            if (err==1)
                fprintf (stderr, "Error de sintaxis en linea %d\n", nlinea);
            else if (err==2)
                fprintf (stderr, "Simbolo [%s] no encontrado en linea %d\n", token, nlinea);
            estado = 2;
        }
        else
        {
            maxsim++;
            token = NULL;
            estado = 0;
        }
    }
    else 
    {
        tablasim[maxsim].valor = pcounter;
        maxsim++;
        if (es_instruccion (token))
            pcounter++;
        token = NULL;
        estado = 0;
    }
    break;

The second pass is where I actually assemble instructions, replacing a symbol with its value when I find one. It's rather simple, using strtok() to split a line into its components, and using strncasecmp() to compare what I find with instruction mnemonics

Upvotes: 2

src
src

Reputation: 551

If the operands can be expressions, like "1 << (x + 5)", you will need to write a parser. If not, the parser is so simple that you do not need to think in those terms. For each line get the first string (skipping whitespace). Does the string end with a colon? then it is a label, else it is the menmonic. etc.

Upvotes: 1

Tim
Tim

Reputation: 5681

What you're looking for is actually lexical analyses, parsing en finally the generation of the compiled code. There are a lot of frameworks out there which helps creating/generating a parser like Gold Parser or ANTLR. Creating a language definition (and learning how to depending on the framework you use) is most often quite a lot of work.

I think you're best off with implementing the shunting yard algorithm. Which converts your source into a representation computers understand, which makes it easy to understand for your virtual machine.

I also want to say that diving into parsers, abstract syntax trees, all the tools available on the web and reading a lot of papers about this subject is a really good learning experience!

Upvotes: 2

Related Questions