Jeremy Fisher
Jeremy Fisher

Reputation: 2782

How to convert Earley recognizer to Earley Parser

I have implemented an Earley Recognizer algorithm. I am having trouble figuring out how to get the parse tree from the chart. I have back pointers pointing to the rule that "generated" the rule being added to the chart, but I am taking this quite literally in that, for the current rule R1:

1) if it has a terminal after the dot, check the terminal against the input sentence, add the rule R2 to the next column where R2 is the same as R1 but with the dot shifted. The back pointer of R2 = the back pointer of R1.

2) if there is a non terminal after the dot, add new rules to the current column and the back pointers for each of the new rules point to R1

3) if there is nothing after the dot (completed rule), then R1 is complete, so we scan in the column (less than curr column) associated with R1, find all rules Rj that have the left hand side of R1 after the dot, add Rj to the current column but shift the dot, make the backpointer of Rj point to R1.

I don't think I am getting the right output, so I'm wondering if it's a problem with my logic. What needs to be done to the Earley recognizer to convert it to an Earley parser?

I have a print_parse method which recurses on the back pointers of the rules, but I don't think it produces the correct output. For the sentence

Papa ate the caviar

with (ignoring probabilities) grammar

1   ROOT    S
1   S   NP VP
0.8 NP  Det N
0.1 NP  NP PP
0.7 VP  V NP
0.3 VP  VP PP
1   PP  P NP
0.1 NP  Papa
0.5 N   caviar
0.5 N   spoon
1   V   ate
1   P   with
0.5 Det the
0.5 Det a

it generates:

(ROOT ['S'])(S ['NP', 'VP'])(NP ['Papa'])(S ['NP', 'VP'])(VP ['V', 'NP'])(V ['ate'])(VP ['V', 'NP'])(NP ['Det', 'N'])(Det ['the'])(NP ['Det', 'N'])(N ['caviar'])(ROOT ['S'])

I know the parse chart is correct however as I've checked it against doing the parse table manually (by hand). I know other questions have been asked about this, but they all point to papers and frankly they are difficult. I would really appreciate some help.

Upvotes: 2

Views: 835

Answers (1)

Patrick Huber
Patrick Huber

Reputation: 756

Elizabeth Scott published a paper on how to form parse forests in worst case O(n^3) time. Here is the paper link: http://www.sciencedirect.com/science/article/pii/S1571066108001497/pdf?md5=0cbf8671559f121c5d4e612cc392604f&pid=1-s2.0-S1571066108001497-main.pdf . The integrated algorithm is on page 63 and I briefly describe it below.

I think she found a bug in the back pointer suggested implementation, not sure if you are running into the same issue. It is outlined in section 2 and section 3 starting on page 55. Basically when adding multiple back pointers to a item can fail.

I say parse forest because earley parsers can handle ambiguity very well and the parse forest represents this ambiguity through "or" nodes. A "or" node represents an alternate path in the parse forest.

I implemented it in c# in my earley parsing library here: https://github.com/patrickhuber/pliant

Basically you create four types of nodes:

  1. Intermediate nodes - serve only as pass through nodes in order to keep the tree binary. Without them the tree wouldn't be binary and would have a worse case time complexity greater than O(n^3).
  2. Symbol nodes - are the actual nodes of the parse forest that represent the non terminals in your grammar.
  3. Terminal nodes - are the data found in the input string when running the parse.
  4. Or nodes - represent the ambiguity choices in the grammar.

A rough description of the algorithm:

You need to keep track of the nodes added during the current earley pass. If you encounter the need for the same node, you reuse it from the node sets. You clear this set after each pass. It is best to wrap this in a "Create Node" function and put a method on the node set named something like AddOrGetExistingNode. Scott has a "MAKE_NODE" function in her paper.

During scan, pretty easy, just create/reuse a token node and attach it to the scan earley item.

During a predict, you only create a parse node if the current non-terminal is nullable in the grammar. This is a bug fix found by Aycock and Horspool in their paper "Practical Earley Parsing". In this sitiuation you are doing a completion as well as a prediction. I answer a question on it here: Earley cannot handle epsilon-states already contained in chart and showed some examples.

During a completion you create a parse node for the new state and add the nodes attached to the nodes of the completed item and the predicted item nodes.

When you create a node, if the position of the dotted rule is in the middle of the production you create an intermediate node. If it is at the end of a rule, you create a symbol node.

Upvotes: 2

Related Questions