stygma
stygma

Reputation: 523

matching parts of reg

I am attempting to parse with regex a series of lines of psudeo-assembly code that are the following formats:

optional_label  required_instruction    optional_parameter, optional_parameter

And actual example looks a bit more like:

PRINTLOOP   MOV R6, R7
CMP R6, R9
TRP 1
BLK

Where MOV,CMP,BLK and BRZ are instructions.

Whitespace between tokens can be any number of spaces or tabs, labels must start at the beginning of a line while instructions can either start at the beginning or have any amount of leading spaces or tabs.

I need to get at each bit of it individually so it is important that the regex groups it properly. I am currently trying to use this pattern:

    ((?<label>[\w]*)[ |\t]+)?(?<operator>[\w]+)[ |\t]+(?<operand1>[\w]+)?(,[ |\t]*(?<openparen>\()?(?<operand2>[-]*[\w]+)(?<closeparen>\))?)?

This pattern has worked fine until now because there was always at least one parameter, but now I have zero parameter instructions which don't fit in nicely to this. I tried to tweak the pattern to be the following:

    ((?<label>[\w]*)[ |\t]+)?(?<operator>[\w]+)([ |\t]+(?<operand1>[\w]+))?(,[ |\t]*(?<openparen>\()?(?<operand2>[-]*[\w]+)(?<closeparen>\))?)?

So that the space after the instruction(operator) isn't mandatory but I found that this made things ambiguous enough that the instruction is perceived to be the label in many instructions. For example:

    LDB     R0,      lM

Is understood as label: LDB, Instruction: R0 and neither operand is recognized.

Is their a way to either force the operator section to be checked first (so that that part of the string is prioritized), resources that will explain where I am going wrong in all this, or a regex pattern that will do what I am looking for?

Upvotes: 1

Views: 46

Answers (1)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726599

Your problem cannot be solved even in theory, because your grammar is ambiguous: when you are looking at

INC R6

your grammar can parse it in the two ways below:

label=INC, Instruction=R6

or

Instruction=R6, Parameter1=R6

Assembly languages that I've worked with and/or implemented solve this problem by requiring a column after the optional label, like this:

[label:]  instruction [parameter] [, optional_parameter]

This would give your regex an additional "anchor" (i.e. the colon :) by which to tell the label+instruction vs. instruction+parameter situation.

Another alternative is to introduce "keywords" for the instructions, and prohibiting the use of these keywords as labels. This would let you avoid introducing a colon, but would make a regex-based solution impractical.

Upvotes: 3

Related Questions