Reputation: 523
I am attempting to parse with regex a series of lines of psudeo-assembly code that are the following formats:
optional_label required_instruction optional_parameter, optional_parameter
And actual example looks a bit more like:
PRINTLOOP MOV R6, R7
CMP R6, R9
TRP 1
BLK
Where MOV,CMP,BLK and BRZ are instructions.
Whitespace between tokens can be any number of spaces or tabs, labels must start at the beginning of a line while instructions can either start at the beginning or have any amount of leading spaces or tabs.
I need to get at each bit of it individually so it is important that the regex groups it properly. I am currently trying to use this pattern:
((?<label>[\w]*)[ |\t]+)?(?<operator>[\w]+)[ |\t]+(?<operand1>[\w]+)?(,[ |\t]*(?<openparen>\()?(?<operand2>[-]*[\w]+)(?<closeparen>\))?)?
This pattern has worked fine until now because there was always at least one parameter, but now I have zero parameter instructions which don't fit in nicely to this. I tried to tweak the pattern to be the following:
((?<label>[\w]*)[ |\t]+)?(?<operator>[\w]+)([ |\t]+(?<operand1>[\w]+))?(,[ |\t]*(?<openparen>\()?(?<operand2>[-]*[\w]+)(?<closeparen>\))?)?
So that the space after the instruction(operator) isn't mandatory but I found that this made things ambiguous enough that the instruction is perceived to be the label in many instructions. For example:
LDB R0, lM
Is understood as label: LDB, Instruction: R0 and neither operand is recognized.
Is their a way to either force the operator section to be checked first (so that that part of the string is prioritized), resources that will explain where I am going wrong in all this, or a regex pattern that will do what I am looking for?
Upvotes: 1
Views: 46
Reputation: 726599
Your problem cannot be solved even in theory, because your grammar is ambiguous: when you are looking at
INC R6
your grammar can parse it in the two ways below:
label=INC, Instruction=R6
or
Instruction=R6, Parameter1=R6
Assembly languages that I've worked with and/or implemented solve this problem by requiring a column after the optional label, like this:
[label:] instruction [parameter] [, optional_parameter]
This would give your regex an additional "anchor" (i.e. the colon :
) by which to tell the label+instruction vs. instruction+parameter situation.
Another alternative is to introduce "keywords" for the instructions, and prohibiting the use of these keywords as labels. This would let you avoid introducing a colon, but would make a regex-based solution impractical.
Upvotes: 3