user2380317
user2380317

Reputation: 25

Assembly language parser implementation

So, i'm continuing my wandering around and pretty sure i'm ended up in need of some open source assembler command lexem analyzer (some TinyPG implementation, maybe).

All i want to know, is HOW i can make my apps understand, that given text MIGHT be assembler code. for example

mov ah, 37

should be accepted, while

bad my 42

should not.

Advices on self-implementing are welcomed too, ofc. Because i'm not sure if i would understand "hardcore" implementations.

Upvotes: 1

Views: 4887

Answers (2)

Oak
Oak

Reputation: 26868

The best way to check if some text might be in some language is to try and parse it - embed the assembler in your application and invoke it. I strongly recommend that approach - even for assembly code the input can contain some special syntax or construction that you haven't thought of and you'll end up emitting a false negative.

This is especially true with assembly code - lexing and parsing it is very cheap compared to other languages, there's not much harm in doing it twice.

If you try to craft a fancy regex pattern yourself, you'll just end up duplicating the first stages of the assembler anyway, only you'll have to debug it yourself - it's better to go with a complete and tested solution.

Upvotes: 3

Anders Abel
Anders Abel

Reputation: 69260

For a decently accurate identification, checking that the lines match a regex will be okay. That's actually very similar to the first step of a compiler - the scanning phase - where the contents of the file are read and the tokens identified. The next step - the actual parsing is more complex (although not that complex for assembler).

An example of a regex would be something like this:

^[ \t]*((mov|xor|add|mul)[ \t]*([abcde][xhl]|[cd]s)[ \t]*,)|jmp[ \t]*([abcde][xhl]|[cd]s|[0-9A-F]*)[ \t]*$

It first checks the valid instructions with two parameters, then the existence of a parameter, followed by the alternative of single param instructions and then the existence of another parameter - including a numeric constant which is valid as the second param.

Upvotes: 1

Related Questions