Reputation: 7732
Stanford Core NLP parser generates the following output for the sentence:
"He didn't get a reply"
(ROOT
(S
(NP (PRP He))
(VP (VBD did) (RB n’t)
(VP (VB get)
(NP (DT a) (NN reply))))
(. .)))
I need a way to navigate it i.e. extra tags easily and find children and parents. Currently I am doing it manually (counting parenthesis). I wonder if there is a Python library that can do parenthesis counting for me or even better like Beautiful Soup or Scrapy would let me work with objects.
If there is no tool, what is the best way to traverse a sentence and get all tags? I am guessing I would need to create some sort of tag object with list containing children tag objects.
Upvotes: 1
Views: 1646
Reputation: 21
I have used this python script for solving the problem with success.
The script can be used to convert Stanford Core NLP's Lisp-like parse tree format into a nested python list structure.
You can also convert the nested list further using something like Anytree into a more navigable Python data structure, which will also allow you to print out the tree either in text or as an image.
Upvotes: 2
Reputation: 59
My approach to navigating the output was not to try parsing the string, but instead to build an object and deserialize. Then you have the object available natively.
The output shown in the question is produced using an option on the pipeline called "prettyPrint". I changed that to "jsonPrint" to get JSON output instead. I was then able to take the output and generate a class from it (VS can do a Paste-Special option to generate a class from JSON, or there are online resources like http://json2csharp.com/). The resultant class looked like this:
public class BasicDependency
{
public string dep { get; set; }
public int governor { get; set; }
public string governorGloss { get; set; }
public int dependent { get; set; }
public string dependentGloss { get; set; }
}
public class EnhancedDependency
{
public string dep { get; set; }
public int governor { get; set; }
public string governorGloss { get; set; }
public int dependent { get; set; }
public string dependentGloss { get; set; }
}
public class EnhancedPlusPlusDependency
{
public string dep { get; set; }
public int governor { get; set; }
public string governorGloss { get; set; }
public int dependent { get; set; }
public string dependentGloss { get; set; }
}
public class Token
{
public int index { get; set; }
public string word { get; set; }
public string originalText { get; set; }
public string lemma { get; set; }
public int characterOffsetBegin { get; set; }
public int characterOffsetEnd { get; set; }
public string pos { get; set; }
public string ner { get; set; }
public string speaker { get; set; }
public string before { get; set; }
public string after { get; set; }
public string normalizedNER { get; set; }
}
public class Sentence
{
public int index { get; set; }
public string parse { get; set; }
public List<BasicDependency> basicDependencies { get; set; }
public List<EnhancedDependency> enhancedDependencies { get; set; }
public List<EnhancedPlusPlusDependency> enhancedPlusPlusDependencies { get; set; }
public List<Token> tokens { get; set; }
}
public class RootObject
{
public List<Sentence> sentences { get; set; }
}
*Note: Unfortunately this technique did not work out for coref annotations. The JSON did not convert properly to a class. I'm working on that now. This model was built from output using the annotators "tokenize, ssplit, pos, lemma, ner, parse".
My code, only slightly changed from the sample code, looks like this (note the "pipeline.jsonPrint"):
public static string LanguageAnalysis(string sourceText)
{
string json = "";
// Path to the folder with models extracted from stanford-corenlp-3.7.0-models.jar
var jarRoot = @"..\..\..\..\packages\Stanford.NLP.CoreNLP.3.7.0.1\";
// Annotation pipeline configuration
var props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse");
props.setProperty("ner.useSUTime", "0");
// We should change current directory, so StanfordCoreNLP could find all the model files automatically
var curDir = Environment.CurrentDirectory;
Directory.SetCurrentDirectory(jarRoot);
var pipeline = new StanfordCoreNLP(props);
Directory.SetCurrentDirectory(curDir);
// Annotation
var annotation = new Annotation(sourceText);
pipeline.annotate(annotation);
// Result - JSON Print
using (var stream = new ByteArrayOutputStream())
{
pipeline.jsonPrint(annotation, new PrintWriter(stream));
json = stream.toString();
stream.close();
}
return json;
}
It seems to deserialize nicely with code like this:
using Newtonsoft.Json;
string sourceText = "My text document to parse.";
string json = Analysis.LanguageAnalysis(sourceText);
RootObject document = JsonConvert.DeserializeObject<RootObject>(json);
Upvotes: 1
Reputation: 3905
This looks like LISP. Writing a Lisp program to traverse it and extract what you want seems easy.
You could also convert it into a list in python and process in python:
from pyparsing import OneOrMore, nestedExpr
nlpdata = '(ROOT (S (NP (PRP He)) (VP (VBD did) (RB n\'t) (VP (VB get) (NP (DT a) (NN reply)))) (. .)))'
data = OneOrMore(nestedExpr()).parseString(nlpdata)
print data
# [['ROOT', ['S', ['NP', ['PRP', 'He']], ['VP', ['VBD', 'did'], ['RB', "n't"], ['VP', ['VB', 'get'], ['NP', ['DT', 'a'], ['NN', 'reply']]]], ['.', '.']]]]
Note that I had to escape the quote in "n't"
Upvotes: 1