Reputation: 325
I have a problem with showing the most likely constituency structure of some sentence using NLTK's probabilistic grammar.
Here is my sentence "Ich sah den Tiger under der Felse"
Here is my code:
from nltk import PCFG
tiger_grammar = PCFG.fromstring("""
S -> NP VP [1.0]
NP -> ART NN [0.25] | PPER [0.5] | NP PP [0.25]
VP -> VVFIN NP [0.75] | VVFIN NP PP [0.25]
PP -> APPR NP [1.0]
APPR -> 'unter' [1.0]
PPER -> 'Ich' [1.0]
VVFIN -> 'sah' [1.0]
NN -> 'Tiger' [0.5] | 'Felse' [0.5]
ART -> 'den' [0.5] | 'der' [0.5]
""")
viterbi_parser = nltk.ViterbiParser(tiger_grammar)
trees = viterbi_parser.parse(['Ich', 'sah', 'den', 'Tiger', 'unter', 'der', 'Felse'])
for t in trees:
print(t)
Here is what I get:
(S
(NP (PPER Ich))
(VP
(VVFIN sah)
(NP (ART den) (NN Tiger))
(PP (APPR unter) (NP (ART der) (NN Felse))))) (p=0.000488281)
But the desired result is:
(S
(NP (PPER Ich))
(VP
(VVFIN sah)
(NP
(NP (ART den) (NN Tiger))
(PP (APPR unter) (NP (ART der) (NN Felse))))))
(I didn't add the probability here, but it should be displayed as well)
According to the grammar, the probability to form VP
from VVFIN
and NP
is higher than from VVFIN
, NP
and PP
. But the parser shows the second structure.
What am I doing wrong?
Would be grateful for suggestions!
Upvotes: 1
Views: 257
Reputation: 1864
Simply because your desired result has lower probability then the result you got. We can compute the probability of your desired result:
S -> NP VP 1.0
NP -> PPER 0.5
PPER -> Ich 1.0
VP -> VVFIN NP 0.75
VVFIN -> sah 1.0
NP -> NP PP 0.25
NP -> ART NN 0.25
ART -> den 0.5
NN -> Tiger 0.5
PP -> APPR NP 1.0
APPR -> unter 1.0
NP -> ART NN 0.25
ART -> der 0.5
NN -> Felse 0.5
Multiplied together gets probability 0.0003662109375
, which is definitely less than the result you got 0.000488281
.
Upvotes: 1