Hyperboreus
Hyperboreus

Reputation: 32459

Determining FIRST sets of CFG

For LR-parsers the FIRST sets are defined as follows (source):

FIRST(A) is the set of terminals which can appear as the first element of any chain of rules matching nonterminal A.

Now given a CFG, that (I) does not allow empty productions (i.e. no rule is of format X → ε) and (II) is proper (i.e. no symbol produces itself), I am trying to determine the FIRST sets.

My reasoning is:

As in the case of X → Yα, I need FIRST(Y) in order to determine FIRST(X), I came up with this recursive approach:

class Rule:
    nextId = 0

    def __init__ (self, left, right):
        self.left = left
        self.right = right
        self.id = Rule.nextId
        Rule.nextId += 1

    def __repr__ (self):
        return 'R{}: {} → {}'.format (self.id, self.left, ' '.join (self.right) )

class Grammar:
    def __init__ (self, rules):
        self.rules = {rule.id: rule for rule in rules}
        self.symbols = set (symbol for rule in rules for symbol in [rule.left] + rule.right)
        self.nonTerminals = set (rule.left for rule in rules)
        self.terminals = self.symbols - self.nonTerminals
        self.calculateFirst ()

    def calculateFirst (self):
        self.first = {}
        for nonTerminal in self.nonTerminals:
            self.first [nonTerminal] = self.getFirst (nonTerminal)

    def getFirst (self, symbol):
        if symbol in self.first: return self.first [symbol]

        first = set ()
        for rule in (rule for rule in self.rules.values () if rule.left == symbol):
            prefix = rule.right [0]
            if prefix == symbol: continue
            if prefix in self.terminals: first.add (prefix)
            else: first |= self.getFirst (prefix)

        return first

#here be dragons
rules = [Rule ('S', ['E'] ), Rule ('E', ['T'] ), Rule ('E', ['(', 'E', ')'] ), Rule ('T', ['n'] ), Rule ('T', ['+', 'n'] ), Rule ('T', ['T', '+', 'n'] ) ]


g = Grammar (rules)
print ('Rules:')
for rule in g.rules.values (): print ('\t{}'.format (rule) )
for nonTerminal in g.nonTerminals:
    print ('First ({}) = {}'.format (nonTerminal, g.first [nonTerminal] ) )

For the grammar given on wikipedia, this yields the following:

Rules:
    R0: S → E
    R1: E → T
    R2: E → ( E )
    R3: T → n
    R4: T → + n
    R5: T → T + n
First (S) = {'+', '(', 'n'}
First (E) = {'+', '(', 'n'}
First (T) = {'+', 'n'}

For another grammar it yields:

Rules:
    R0: S → N
    R1: N → V = E
    R2: N → E
    R3: E → V
    R4: V → x
    R5: V → * E
First (V) = {'*', 'x'}
First (S) = {'*', 'x'}
First (N) = {'*', 'x'}
First (E) = {'*', 'x'}

My questions are:

1. Will this algorithm halt for any given set of rules complying with I and II.

2. Does this algorithm actually produce the correct FIRST sets of all non-terminals for any given set of rules complying with I and II.

3. Is there some clever way to do this?

I thank you in advance for your comments and answers.

Nota bene: I was unsure whether to post this here or on code-review, but as I do not know whether my code works (i.e. yields the expected results) I decided to post it here. If you feel that it belongs rather to CR, please let me know.

Upvotes: 2

Views: 2462

Answers (1)

rici
rici

Reputation: 241931

I think your program will work, in the sense that it will terminate and produce the correct answer if there are no nullable non-terminals. (Self-referencing non-terminals should be fine.)

There is a clever way to do it. Isn't there always? As is often the case, the clever solution was thought up by Robert Tarjan, although there are other clever solutions, too. However, for most grammars, its generally just about as fast to use a simple solution.

Define the relationship A directly-starts-with V where A is a non-terminal and V is any symbol; A directly-starts-with V if there is some production A → V α. For a grammar without nullable non-terminals, FIRST(A) is simply the transitive-closure of the directly-starts-with relationship, range restricted to terminals. Tarjan's algorithm (which computes the strongly connected components of a graph) leads to a fast and elegant solution to this problem.

There are other good algorithms for computing transitive closure. Floyd-Warshall is another popular choice. As you might guess looking at F-W, you can reduce transitive closure to something very similar to matrix multiplication; the same techniques which can be used to reduce the time-complexity of matrix multiplication also apply to transitive closure; however, they're not particularly useful in the real world.

Nullable non-terminals don't complicate the situation very much, as long as you can identify them (and that's really easy). You just extend directly-starts-with to include any symbol on the RHS which is only preceded by nullable non-terminals. The usual algorithm for identifying nullable non-terminals will give you directly-starts-with almost for free.

Upvotes: 2

Related Questions