BrockLee
BrockLee

Reputation: 981

How can I reformat the output of the Malt Parser in NLTK?

So I finally figured out how to use the malt wrapper provided in the NLTK from "How to use malt parser in python nltk" and was able to to chunk my sentences successfully, but my sentences come out in a format I'm unfamiliar with.

For example, parsing "This is a sentence" returns:

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "This is a test sentence"
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(This (sentence is a test))

Parsing a more complex sentence returns:

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "A ceasefire for east Ukraine has been agreed during talks in Minsk."
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(agreed
   (ceasefire A (for (Ukraine east)))
   has
   been
   (during (talks (in Minsk)))
   .)

Could someone please explain what this output format is or how I can parse it in such a way that makes it look like the original sentence:

(This (is a test sentence))
A (ceasefire (for (east Ukraine))) has been (agreed (during (talks (in Minsk))).)

If it helps, graph is an nltk DependencyGraph and graph.tree() is an nltk Tree.

Thanks in advance.

Upvotes: 0

Views: 590

Answers (1)

Linguist
Linguist

Reputation: 123

MaltParser is a system for data-driven "dependency parsing", which can be used to induce a parsing model from treebank data and to parse new data using an induced model.

The files engmalt.poly-1.7.mco and engmalt.linear-1.7.mco contain single malt configurations for parsing English text with MaltParser.

The two models differ in that engmalt.poly-1.7.mco uses SVMs with a polynomial kernel for classification, while engmalt.linear-1.7.mco uses linear SVMs. While the latter parser is much faster, the former requires less memory, and parsing accuracy is similar for the two models. And also the way our output parsed texts are written.

With engmalt.poly-1.7.mco, output parsed text are represented in dependency annotation/ dependency graphs where engmalt.linear-1.7.mco represents in linear way.

Please follow the below outputs. Hope this helps.

With mco="engmalt.linear-1.7"

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "This is a test sentence"
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(This (sentence is a test))

With mco="engmalt.poly-1.7"

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.poly-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "This is a test sentence"
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(is This (a (sentence test)))

For your new complex sentence, With mco="engmalt.linear-1.7"

>>> import nltk
>>> parser = nltk.parse.malt.MaltParser(working_dir="/path/to/dir",mco="engmalt.linear-1.7",additional_java_args=['-Xmx512m'])
>>> txt = "A ceasefire for east Ukraine has been agreed during talks in Minsk."
>>> graph = parser.raw_parse(txt)
>>> graph.tree().pprint()
(A\n  (agreed\n    (been ceasefire for east Ukraine has)\n    (during (Minsk talks in)))\n  .)

Upvotes: 1

Related Questions