Reputation: 872
I have a PENN-Syntax-Tree and I would like to recursively get all rules that this tree contains.
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)
my target is to get the grammar rules like:
ROOT --> S
S --> NP VP
NP --> NN
...
As I said I need to do this recursively and without the NLTK Package or any other modules or regular expression. Here's what I have so far. The parameter tree
is a Penn-Tree splitted on each space.
def extract_rules(tree):
tree = tree[1:-1]
print("\n\n")
if len(tree) == 0:
return
root_node = tree[0]
print("Current Root: "+root_node)
remaining_tree = tree[1:]
right_side = []
temp_tree = list(remaining_tree)
print("remaining_tree: ", remaining_tree)
symbol = remaining_tree.pop(0)
print("Symbol: "+symbol)
if symbol not in ["(", ")"]:
print("CASE: No Brackets")
print("Rule: "+root_node+" --> "+str(symbol))
right_side.append(symbol)
elif symbol == "(":
print("CASE: Opening Bracket")
print("Temp Tree: ", temp_tree)
cursubtree_end = bracket_depth(temp_tree)
print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
cursubtree_start = temp_tree.index(symbol)
cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
print("Subtree: ", cursubtree)
rnode = extract_rules(cursubtree)
if rnode:
right_side.append(rnode)
print("Rule: "+root_node+" --> "+str(rnode))
print(right_side)
return root_node
def bracket_depth(tree):
counter = 0
position = 0
subtree = []
for i, char in enumerate(tree):
if char == "(":
counter = counter + 1
if char == ")":
counter = counter - 1
if counter == 0 and i != 0:
counter = i
position = i
break
subtree = tree[0:position+1]
return position
Currently it works for the first subtree of S
but all other subtrees are not getting parsed recursively. Would be glad for any help..
Upvotes: 3
Views: 613
Reputation: 41895
My inclination would be to keep it as simple as possible and not try to reinvent the parsing modules that you're currently not allowed to use. Something like:
string = '''
(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP (DT a) (NN talk)))
)
)
'''
def is_symbol_char(character):
'''
Predicate to test if a character is valid
for use in a symbol, extend as needed.
'''
return character.isalpha() or character in '-=$!?.'
def tokenize(characters):
'''
Process characters into a nested structure. The original string
'(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
'''
tokens = []
while characters:
character = characters.pop(0)
if character.isspace():
pass # nothing to do, ignore it
elif character == '(': # signals start of recursive analysis (push)
characters, result = tokenize(characters)
tokens.append(result)
elif character == ')': # signals end of recursive analysis (pop)
break
elif is_symbol_char(character):
# if it looks like a symbol, collect all
# subsequents symbol characters
symbol = ''
while is_symbol_char(character):
symbol += character
character = characters.pop(0)
# push unused non-symbol character back onto characters
characters.insert(0, character)
tokens.append(symbol)
# Return whatever tokens we collected and any characters left over
return characters, tokens
def extract_rules(tokens):
''' Recursively walk tokenized data extracting rules. '''
head, *tail = tokens
print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])
for token in tail: # recurse
if isinstance(token, list):
extract_rules(token)
characters, tokens = tokenize(list(string))
# After a successful tokenization, all the characters should be consumed
assert not characters, "Didn't consume all the input!"
print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')
extract_rules(tokens[0])
OUTPUT
Tokens:
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]
Rules:
ROOT --> S
S --> NP VP
NP --> NN DT NN
NN --> Carnac
DT --> the
NN --> Magnificent
VP --> VBD NP
VBD --> gave
NP --> DT NN
DT --> a
NN --> talk
NOTE
I changed your original tree as this clause:
(NP ((DT a) (NN talk)))
seemed incorrect as it was producing an empty node on a syntax tree grapher available on the web so I simplified it to:
(NP (DT a) (NN talk))
Adjust as needed.
Upvotes: 5
Reputation: 402814
This can be done in a much simpler manner. Given we know the structure of our grammar is CNF LR, we can use a recursive regular expression parser to parse the text.
There's a package called pyparser (you can install it with pip install pyparser
if you don't already have it).
from pyparsing import nestedExpr
astring = '''(ROOT
(S
(NP (NN Carnac) (DT the) (NN Magnificent))
(VP (VBD gave) (NP ((DT a) (NN talk))))
)
)'''
expr = nestedExpr('(', ')')
result = expr.parseString(astring).asList()[0]
print(result)
This gives
['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', [['DT', 'a'], ['NN', 'talk']]]]]]
So we've successfully translated our string into a hierarchy of lists. Now we need to write a little code to parse the list and extract rules.
def get_rules(result, rules):
for l in result[1:]:
if isinstance(l, list) and not isinstance(l[0], list):
rules.add((result[0], l[0]))
get_rules(l, rules)
elif isinstance(l[0], list):
rules.add((result[0], tuple([x[0] for x in l])))
else:
rules.add((result[0], l))
return rules
As I mentioned, we already know the structure of our grammar, so we've only to take care of a limited number of conditions here.
Call this function as such:
rules = get_rules(result, set()) # results was obtained from before
for i in rules:
print i
Output:
('ROOT', 'S')
('VP', 'NP')
('DT', 'the')
('NP', 'NN')
('NP', ('DT', 'NN'))
('NP', 'DT')
('S', 'VP')
('VBD', 'gave')
('NN', 'Carnac')
('NN', 'Magnificent')
('S', 'NP')
('VP', 'VBD')
Order this as you need.
Upvotes: 3