Difficulty with my lexical analyzer

I'm trying to program a lexical analyzer to a standard C translation unit, so I've divided the possible tokens into 6 groups; for each group there's a regular expression, which will be converted to a DFA:

Keyword - (will have a symbol table containing "goto", "int"....)
Identifers - [a-zA-z][a-zA-Z0-9]*
Numeric Constants - [0-9]+/.?[0-9]*
String Constants - ""[EVERY_ASCII_CHARACTER]*""
Special Symbols - (will have a symbol table containing ";", "(", "{"....)
Operators - (will have a symbol table containing "+", "-"....)

My Analyzer's input is a stream of bytes/ASCII characters. My algorithm is the following:

assuming there's a stream of characters, x1...xN
 foreach i=1, i<=n, i++
    if x1...xI accepts one or more of the 6 group's DFA
    {
       take the longest-token
       add x1...xI to token-linked-list
       delete x1...xI from input
    }

However, this algorithm will assume that every byte it is given, which is a letter, is an identifier, since after an input of 1 character, it accepts the DFA of the identifiers tokens ([a-zA-Z][a-zA-Z0-9]*).

Another possible problem is for the input "intx;", my algorithm will tokenize this stream into "int", "x", ";" which of course is an error.

I'm trying to think about a new algorithm, but I keep failing. Any suggestions?

Upvotes: 2

Answers (3)

Jonathan Leffler

Reputation: 753525

Code your scanner so that it treats identifiers and keywords the same until the reading is finished.

When you have the complete token, look it up in the keyword table, and designate it a keyword if you find it and as an identifier if you don't find it. This deals with the intx problem immediately; the scanner reads intx and that's not a keyword so it must be be an identifier.

I note that your identifiers don't allow underscores. That's not necessarily a problem, but many languages do allow underscores in identifiers.

Upvotes: 2

mrjoltcola

Reputation: 20842

It seems you are missing the greediness aspect of competing DFAs. greedy matching is usually the most useful (left-most longest match) because it solves the problem of how to choose between competing DFAs. Once you've matched int you have another node in the IDENTIFIER DFA that advances to intx. Your finate automata doesn't exit until it reaches something it can't consume, and if it isn't in a valid accept state at the end of input, or at the point where another DFA is accepting, it is pruned and the other DFA is matched.

Flex, for example, defaults to greedy matching.

In other words, your proposed problem of intx isn't a problem...

If you have 2 rules that compete for int

rule 1 is the token "int"
rule 2 is IDENTIFIER

When we reach

i n t

we don't immediately ACCEPT int because we see another rule (rule 2) where further input x progresses the automata to a NEXT state:

i n t x

If rule 2 is in an ACCEPT state at that point, then rule 1 is discarded by definition. But if rule 2 is still not in ACCEPT state, we must keep rule 1 around while we examine more input to see if we could eventually reach an ACCEPT state in rule 2 that is longer than rule 1. If we receive some other character that matches neither rule, we check if rule 2 automata is in an ACCEPT state for intx, if so, it is the match. If not, it is discarded, and the longest previous match (rule 1) is accepted, however in this case, rule 2 is in ACCEPT state and matches intx

In the case that 2 rules reach an ACCEPT or EXIT state simultaneously, then precedence is used (order of the rule in the grammar). Generally you put your keywords first so IDENTIFIER doesn't match first.

Upvotes: 1

Ron Pinkas

Reputation: 367

Tokenizers generally FIRST split the input stream into tokens, based on rules which dictate what constitute an END of token, and only later decide what kind of token it is (or an error otherwise). Typical end of token are things like white space (when not part of literal string), operators, special delimiters, etc.

Upvotes: 1

Difficulty with my lexical analyzer

Answers (3)

Related Questions