Determining FIRST sets of CFG

Question

For LR-parsers the FIRST sets are defined as follows (source):

FIRST(A) is the set of terminals which can appear as the first element of any chain of rules matching nonterminal A.

Now given a CFG, that (I) does not allow empty productions (i.e. no rule is of format X → ε) and (II) is proper (i.e. no symbol produces itself), I am trying to determine the FIRST sets.

My reasoning is:

As there are no empty productions, it is sufficient to look at the first symbol of the right hand side of each rule.
For all rules X → tα (being t a terminal and α an arbitrary string of symbols), t is in FIRST(X).
For all rules X → Yα (being Y a non-terminal and α an arbitrary string of symbols), all elements of FIRST(Y) are in FIRST(X).

As in the case of X → Yα, I need FIRST(Y) in order to determine FIRST(X), I came up with this recursive approach:

class Rule:
    nextId = 0

    def __init__ (self, left, right):
        self.left = left
        self.right = right
        self.id = Rule.nextId
        Rule.nextId += 1

    def __repr__ (self):
        return 'R{}: {} → {}'.format (self.id, self.left, ' '.join (self.right) )

class Grammar:
    def __init__ (self, rules):
        self.rules = {rule.id: rule for rule in rules}
        self.symbols = set (symbol for rule in rules for symbol in [rule.left] + rule.right)
        self.nonTerminals = set (rule.left for rule in rules)
        self.terminals = self.symbols - self.nonTerminals
        self.calculateFirst ()

    def calculateFirst (self):
        self.first = {}
        for nonTerminal in self.nonTerminals:
            self.first [nonTerminal] = self.getFirst (nonTerminal)

    def getFirst (self, symbol):
        if symbol in self.first: return self.first [symbol]

        first = set ()
        for rule in (rule for rule in self.rules.values () if rule.left == symbol):
            prefix = rule.right [0]
            if prefix == symbol: continue
            if prefix in self.terminals: first.add (prefix)
            else: first |= self.getFirst (prefix)

        return first

#here be dragons
rules = [Rule ('S', ['E'] ), Rule ('E', ['T'] ), Rule ('E', ['(', 'E', ')'] ), Rule ('T', ['n'] ), Rule ('T', ['+', 'n'] ), Rule ('T', ['T', '+', 'n'] ) ]


g = Grammar (rules)
print ('Rules:')
for rule in g.rules.values (): print ('	{}'.format (rule) )
for nonTerminal in g.nonTerminals:
    print ('First ({}) = {}'.format (nonTerminal, g.first [nonTerminal] ) )

For the grammar given on wikipedia, this yields the following:

Rules:
    R0: S → E
    R1: E → T
    R2: E → ( E )
    R3: T → n
    R4: T → + n
    R5: T → T + n
First (S) = {'+', '(', 'n'}
First (E) = {'+', '(', 'n'}
First (T) = {'+', 'n'}

For another grammar it yields:

Rules:
    R0: S → N
    R1: N → V = E
    R2: N → E
    R3: E → V
    R4: V → x
    R5: V → * E
First (V) = {'*', 'x'}
First (S) = {'*', 'x'}
First (N) = {'*', 'x'}
First (E) = {'*', 'x'}

My questions are:

1. Will this algorithm halt for any given set of rules complying with I and II.

2. Does this algorithm actually produce the correct FIRST sets of all non-terminals for any given set of rules complying with I and II.

3. Is there some clever way to do this?

I thank you in advance for your comments and answers.

Nota bene: I was unsure whether to post this here or on code-review, but as I do not know whether my code works (i.e. yields the expected results) I decided to post it here. If you feel that it belongs rather to CR, please let me know.

Determining FIRST sets of CFG

Answers (1)

Related Questions