TheUbMunster
TheUbMunster

Reputation: 55

How to convert a parse tree of one grammar (in CNF form) and apply it to the original, unadulterated grammar

I've written a C# program that takes a list of "Productions" (LHS nonterminal, RHS "recipe"), of a grammar, and applies these transformations on it to reduce the grammar to another grammar that recognizes the same language, but is in Chomsky normal form.

It seems that some algorithms (e.g., CYK) require grammars to be in CNF to operate. My goal is to make a parser that takes a context-free grammar in CNF, turns it into CNF via my program, and then generates a parse tree with some input. However, returning the parse tree "of the form of the CNF-ified" grammar isn't all that useful, so I want to know if there is a general method to construct a parse tree using the definitions of the original grammar.

My original idea was to "flatten" the generated tree and somehow apply the original grammar rules to it, likely involving DFS with some pre/in/post order traversal. I figure that so long as I can guarantee that any particular manipulation of the grammar results in any "original LHS nonterminal" in a recipe directly correlates to the CNF-ified version, that this would be trivial.

e.g.,

"original":
A -> B C D E F
...



"CNF-ified":
A -> B X
X -> C Y
Y -> D Z
Z -> E F
...

(in this example, nonterm "A" are "conceptually the same" in both grammars, so encountering A in the parse tree would lead one to expect B C D E F to follow in the "linearized form")

      A
     / \
    B   X
       / \
      C   Y
         / \
        D   Z
           / \
          E   F

Along the way, I've gone down various routes as I've had ideas to only discover that I've made some incorrect assumptions. Usually this happens when I find out that "X does not guarantee Y".

I feel as though "ambiguity" in regards to grammars make or break this question, as the example:

{} denotes "0 or more of {contents}"

"original":
A -> B { B } { B }

"CNF-ified": (note that BIN and UNIT were omitted for brevity, but they *can* be applied)
A -> B Y Z | B Y | B Z | B
Y -> YY
Y -> B
Z -> ZZ
Z -> B

immediately leads one to ask the question: "for B's 2..n, which belong to the first {} and which belong to the second {}?"

This grammar is ambiguous, but assuming special-case disambiguation rules are present for the original grammar, are there some standard transformations that can be applied to those in order to function with the parser algorithm that acts upon the CNF grammar? And if so, how might this influence the conversion of the CNF parse tree to the original grammar parse tree? This portion of my question may be a bit too open ended, but perhaps my heavy reading of various parsing/(?parser generator?) algorithms, and popular libraries color my view too strongly.

My original motivation for going down this route was discovering that a simple recursive descent LL(k) parser had some infinite recursion issues from the grammar I was operating on, and I felt it was a good (educational) opportunity to try something new. I am not really looking for an answer like "Use antlr/[insert library]". However, if a certain library operates upon a similar line of thinking to me, I would like to learn how the solved these problems.

Note that LL(k) was not sufficient due to rules of this form in the grammar:

A -> B non/terminals...
B -> C non/terminals...
C -> D non/terminals...
D -> A non/terminals...

leading to no tokens being consumed during parse attempt of A. (i.e., expansion of A is "B C D A B C D A B C D..."

It's possible that I can slap a band-aid on the LL(k) parser by massaging the grammar to avoid these issues, but I figure that in addition to this it would be strictly necessary to special-case much of the code, which I would rather avoid, as the grammar is extremely large.

Apparently the grammar I'm using is EBNF, and I handled the extra ({} []) symbols conversion to CNF via the following:

During my observations, it seems that if |[]{} are non-trivially in the recipe, no "one-pass" transformation can "pierce" nested |[]{}, so instead I apply them one at a time, repeatedly, until no more changes are needed.

TransformGrammarVARIABLE() "{}" transforms grammar like this:

original:
A -> 'asd' { B } 'jkl'

transformed:
A -> 'asd' Z 'jkl' | 'asd' 'jkl'
Z -> ZZ
Z -> B

TransformGrammarEITHER() "|" transforms grammar like this:

original:
A -> 'asdf' ( ( B | C ) | ( D | E ) ) 'jkl'

transformed 1:
A -> 'asdf' ( ( B | C ) ) 'jkl'
A -> 'asdf' ( ( D | E ) ) 'jkl'

transformed 2:
A -> 'asdf' B 'jkl'
A -> 'asdf' C 'jkl'
A -> 'asdf' D 'jkl'
A -> 'asdf' E 'jkl'

TransformGrammarOPTIONAL() "[]" transforms grammar like this:

original:
A -> [ B [ C ] ]

transformed 1:
A -> B [ C ]
A -> [ C ]

transformed 2:
A -> BC
A -> B
A -> C
A -> epsilon (to be removed by TransformGrammarDEL())

The algorithm itself is roughly of the form:

bool needPreTransform = true;
while (needPreTransform)
{
    bool t = false;
    t |= TransformGrammarVARIABLE(parsedGrammar);
    t |= TransformGrammarEITHER(parsedGrammar);
    t |= TransformGrammarOPTIONAL(parsedGrammar);
    needPreTransform = t;
}
//at this point there should be no { }, [ ], or |
CNFStartingNonterminal = startingNonterminal = TransformGrammarSTART(parsedGrammar, startingNonterminal);
TransformGrammarTERM(parsedGrammar);
TransformGrammarBIN(parsedGrammar);
TransformGrammarDEL(parsedGrammar);
TransformGrammarUNIT(parsedGrammar);

If anyone can provide me with some insight on this, that would be wonderful, thank you for your time.

Various facts that I've picked up along the way:

Upvotes: 2

Views: 60

Answers (0)

Related Questions