Reputation: 384
I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer. For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?
Upvotes: 8
Views: 2509
Reputation: 4762
Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.
Upvotes: 2
Reputation: 95306
I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.
Upvotes: 8