Uran
Uran

Reputation: 141

Lexical Analysis of Preprocessed Code

I have programmed an assembler with a preprocessor for the MOS 6502 microprocessor. The assembler spits out the correct binary and the preprocessor performs constant substitution, inclusions and conditional inclusions. The problem is retaining file positions of the included files. At this point the preprocessor emits a file directive just before and after a file is included. Here is an example.

Proggie.asm

JSR init
JSR loop
JSR end

%include "Init.asm"

%include "Loop.asm"

%include "End.asm"

Init.asm

init:
    LDX #$00
    RTS

Loop.asm

loop:
    INX
    CPX #$05
    BNE loop
    RTS

End.asm

end:
    BRK

Pre Processor Result

%file "D:\Proggie.asm" 1
    JSR init
    JSR loop
    JSR end

%file "D:\Init.asm" 1
init:
    LDX #$00
    RTS%file "D:\Init.asm" 2

%file "D:\Loop.asm" 1
loop:
    INX
    CPX #$05
    BNE loop
    RTS%file "D:\Loop.asm" 2

%file "D:\End.asm" 1
end:
    BRK%file "D:\End.asm" 2
%file "D:\Proggie.asm" 2

This idea comes from the output the preprocessor from GCC produces. The %file directive tells the lexical analyzer that a file has just been entered or exited. The number after the file path says if the analyzer enters or exits the given file respectively. My lexical analyzer kind of works with this. It is still a bit of when telling the current line number.

So my question is: Is this the way to go? Or is there another algorithm I could use?

Upvotes: 0

Views: 85

Answers (1)

rici
rici

Reputation: 241731

Gcc's preprocessor fabricates line control directives which look like this:

# 122 "/usr/include/x86_64-linux-gnu/bits/types.h" 2 3 4

Here, the 122 is the line number in the file /usr/include/x86_64-linux-gnu/bits/types.h. Including the line number means that a downstream lexer doesn't need to track the include stack in order to tell which line it is on.

The rest of the line are flags, which are similar to your approach with the addition of a couple of gcc-specific flags:

  • '1' This indicates the start of a new file.
  • '2' This indicates returning to a file (after having included another file).
  • '3' This indicates that the following text comes from a system header file, so certain warnings should be suppressed.
  • '4' This indicates that the following text should be treated as being wrapped in an implicit 'extern "C"' block.

These allow the downstream lexer to track the include stack if it wishes, and the gcc lexer does so in order to produce more informative (or at least more wordy) error messages.

I think the logic is easier with the preprocessor maintaining the stack, but it doesn't make a huge amount of difference, particularly if you're also going to want to generate "included from" notes in your error messages.

Upvotes: 1

Related Questions