colijuli
colijuli

Reputation: 11

How can I get a GrammaticalStructure object for a German sentence using the Stanford Parser?

I am using the Stanford Parser (Version 3.5.2) for an NLP application that relies on the analysis of dependency parses as well as information from other sources. So far, I've gotten it to work for English, like so:

import java.io.StringReader;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.ling.TaggedWord;
import edu.stanford.nlp.process.Tokenizer;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.TreebankLanguagePack;
import edu.stanford.nlp.trees.TypedDependency;


/**
* Stanford Parser Wrapper (for Stanford Parser Version 3.5.2).
* 
*/

public class StanfordParserWrapper {

public static void parse(String en, String align, String out) {

// setup stanfordparser
String grammar = "edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz";
String[] options = { "-outputFormat", "wordsAndTags, typedDependencies" };
LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options);
TreebankLanguagePack tlp = lp.getOp().langpack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();

// read document
Iterable<List<? extends HasWord>> sentences;
Reader r = new Reader(en);
String line = null;
List<List<? extends HasWord>> tmp = new ArrayList<List<? extends HasWord>>();
while ((line = r.getNext()) != null) {
    Tokenizer<? extends HasWord> token = tlp.getTokenizerFactory()
        .getTokenizer(new StringReader(line));
    List<? extends HasWord> sentence = token.tokenize();
    tmp.add(sentence);
}
sentences = tmp;

Reader alignment = new Reader(align);
Writer treeWriter = new Writer(out);

// parse
long start = System.currentTimeMillis();
// System.err.print("Parsing sentences ");
int sentID = 0;
for (List<? extends HasWord> sentence : sentences) {
    Tree t = new Tree();
    t.setSentID(++sentID);
    System.out.println("parse Sentence " + t.getSentID() + " "
        + sentence + "...");
    // System.err.print(".");

    edu.stanford.nlp.trees.Tree parse = lp.parse(sentence);

    // ROOT node
    Node root = new Node(true, true);
    t.setNode(root);

    // tagging
    int counter = 0;
    for (TaggedWord tw : parse.taggedYield()) {
    Node n = new Node();
    n.setNodeID(++counter);
    n.setSurface(tw.value());
    n.setTag(tw.tag());
    t.setNode(n);
    }

    t.setSentLength(t.getNodes().size() - 1);

    // labeling
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    for (TypedDependency td : tdl) {
    Node dep = t.getNodes().get(td.dep().index());
    Node gov = t.getNodes().get(td.gov().index());
    dep.setLabel(td.reln().toString());
    gov.setChild(dep);
    dep.setParent(gov);
    }

    // combine with alignment
    t.initialize(alignment.readNextAlign());
    treeWriter.write(t);
}
long stop = System.currentTimeMillis();
System.err.println("...done! [" + (stop - start) / 1000 + " sec].");

treeWriter.close();
}

public static void main(String[] args) {
if (args.length == 3) {
    parse(args[0], args[1], args[2]);
} else {
    System.out.println("Usage:");
}
}
}

"Node" and "Tree" are my own classes, not those of the Stanford parser.

My question is this: How can I do the same thing for German? When I replace the English grammar model with "edu/stanford/nlp/models/lexparser/germanPCFG.ser.gz", I get the following exception:

Exception in thread "main" java.lang.UnsupportedOperationException: No GrammaticalStructureFactory defined for edu.stanford.nlp.trees.international.negra.NegraPennLanguagePack
at edu.stanford.nlp.trees.AbstractTreebankLanguagePack.grammaticalStructureFactory(AbstractTreebankLanguagePack.java:591)
at StanfordParserWrapper.parse(StanfordParserWrapper.java:46)
at StanfordParserWrapper.main(StanfordParserWrapper.java:117)

Same thing goes for the "germanFactored" model. Obviously, I need to do something different here, as the German model doesn't support GrammaticalStructureFactory. Is there some way to still get a GrammaticalStructure from German text, or do I have to write my code for German completely differently? If so, I'd be grateful for some pointers, I've looked for this info quite a bit but couldn't find what I was looking for.

This seems relevant: How to parse languages other than English with Stanford Parser? in java, not command lines However, it just tells me that GrammaticalStructureFactory IS supported for Chinese models, but not what I need to do for German parsing.

Thanks a lot,

J

Upvotes: 1

Views: 768

Answers (1)

rec
rec

Reputation: 10895

You don't. The Stanford parser doesn't support dependency analysis (that's what you get from the GrammaticalStructureFactory) for German.

You can try alternative dependency parsers. While Stanford users a rule-based transformation of the constituent tree to a dependency tree, alternatives are typically probabilistic.

  • mate-tools has a dependency parse and a model for German
  • you might roll your own with the MaltParser (I think there are versions of the Tüba D/Z corpus that are compatible with MaltParser)
  • or you could look into ParZu (but beware, it's Prolog)

Upvotes: 2

Related Questions