Reputation: 25
So, i'm continuing my wandering around and pretty sure i'm ended up in need of some open source assembler command lexem analyzer (some TinyPG implementation, maybe).
All i want to know, is HOW i can make my apps understand, that given text MIGHT be assembler code. for example
mov ah, 37
should be accepted, while
bad my 42
should not.
Advices on self-implementing are welcomed too, ofc. Because i'm not sure if i would understand "hardcore" implementations.
Upvotes: 1
Views: 4887
Reputation: 26868
The best way to check if some text might be in some language is to try and parse it - embed the assembler in your application and invoke it. I strongly recommend that approach - even for assembly code the input can contain some special syntax or construction that you haven't thought of and you'll end up emitting a false negative.
This is especially true with assembly code - lexing and parsing it is very cheap compared to other languages, there's not much harm in doing it twice.
If you try to craft a fancy regex pattern yourself, you'll just end up duplicating the first stages of the assembler anyway, only you'll have to debug it yourself - it's better to go with a complete and tested solution.
Upvotes: 3
Reputation: 69260
For a decently accurate identification, checking that the lines match a regex will be okay. That's actually very similar to the first step of a compiler - the scanning phase - where the contents of the file are read and the tokens identified. The next step - the actual parsing is more complex (although not that complex for assembler).
An example of a regex would be something like this:
^[ \t]*((mov|xor|add|mul)[ \t]*([abcde][xhl]|[cd]s)[ \t]*,)|jmp[ \t]*([abcde][xhl]|[cd]s|[0-9A-F]*)[ \t]*$
It first checks the valid instructions with two parameters, then the existence of a parameter, followed by the alternative of single param instructions and then the existence of another parameter - including a numeric constant which is valid as the second param.
Upvotes: 1