Reputation: 1863
I'm trying to build sentence simplification algorithm based on Stanford CoreNLP. One of simplification I want to do - transform sentence with homogeneous parts of sentence to several sentences. E.g.
I love my mom, dad and sister. -> I love my mom. I love my dad. I love my sister.
First of all I build semantic graph for input sentence string
final Sentence parsed = new Sentence(sentence);
final SemanticGraph dependencies = parsed.dependencyGraph();
The dependency graph for this sentence is
-> love/VBP (root)
-> I/PRP (nsubj)
-> mom/NN (dobj)
-> my/PRP$ (nmod:poss)
-> ,/, (punct)
-> dad/NN (conj:and)
-> and/CC (cc)
-> sister/NN (conj:and)
-> dad/NN (dobj)
-> sister/NN (dobj)
Then I found dobj
edges in the graph and nsubj
for (SemanticGraphEdge edge : dependencies.edgeListSorted()) {
if (edge.getRelation().getShortName().startsWith("dobj")) {
modifiers.add(edge);
} else if (edge.getRelation().getShortName().startsWith("nsubj")) {
subj = edge;
}
}
So now I have 3 edges in modifiers
and nsubj
with I
word. And now my proble is how to split the semantic graph into 3 separate graphs.
Of course naive solution was just to build sentence base on subj and governor/dependent from dobj
edges, but I understand that it's a bad idea and won't work on more complicated examples.
for (final SemanticGraphEdge edge : modifiers) {
SemanticGraph semanticGraph = dependencies.makeSoftCopy();
final IndexedWord governor = edge.getGovernor();
final IndexedWord dependent = edge.getDependent();
final String governorTag = governor.backingLabel().tag().toLowerCase();
if (governorTag.startsWith("vb")) {
StringBuilder b = new StringBuilder(subj.getDependent().word());
b.append(" ")
.append(governor.word())
.append(" ")
.append(dependent.word())
.append(". ");
System.out.println(b);
}
}
Can anyone give me some advices? Maybe I missed something useful in coreNLP documentation? Thanks.
Upvotes: 1
Views: 204
Reputation: 1863
Thanks to @JosepValls for the great idea. Here some code samples how I simplify sentences with 3 or more homogeneous words.
First of all, I defined several regexps for cases
jj(optional) nn, jj(optional) nn, jj(optional) nn and jj(optional) nn
jj(optional) nn, jj(optional) nn, jj(optional) nn , jj(optional) nn ...
jj , jj , jj
jj , jj and jj
vb nn(optional) , vb nn(optional) , vb nn(optional)
and so on
Regexps are
Pattern nounAdjPattern = Pattern.compile("(((jj)\\s(nn)|(jj)|(nn))\\s((cc)|,)\\s){2,}((jj)\\s(nn)|(jj)|(nn))");
Pattern verbPatter = Pattern.compile("((vb\\snn|vb)\\s((cc)|,)\\s){2,}((vb\\snn)|vb)");
These pattern will be used to define does input sentence have list of homogeneous word or not and to find boundaries. After that I create list of POSes based on words from original sentence
final Sentence parsed = new Sentence(sentence);
final List<String> words = parsed.words();
List<String> pos = parsed.posTags().stream()
.map(tag -> tag.length() < 2 ? tag.toLowerCase() : tag.substring(0, 2).toLowerCase())
.collect(Collectors.toList());
To match this POS structure with regexpes - concat list to string
String posString = pos.stream().collect(Collectors.joining(" "));
If sentence doesn't match any regex - lets return the same string, other way - lets simplify it.
if (!matcher.find()) {
return new SimplificationResult(Collections.singleton(sentence));
}
return new SimplificationResult(simplify(posString, matcher, words));
In simplify method I'm looking for the boundaries of homogeneous part and extract from words list 3 part - begining and ending, which won't change and homogeneous part, which will be derived into parts. And after deriving homogenous part into pieces - I build several simplified sentences like beginning+piece+ending.
private Set<String> simplify(String posString, Matcher matcher, List<String> words) {
String startPOS = posString.substring(0, matcher.start());
String endPPOS = posString.substring(matcher.end());
int wordsBeforeCnt = StringUtils.isEmpty(startPOS) ? 0 : startPOS.trim().split("\\s+").length;
int wordsAfterCnt = StringUtils.isEmpty(endPPOS) ? 0 : endPPOS.trim().split("\\s+").length;
String wordsBefore = words.subList(0, wordsBeforeCnt)
.stream()
.collect(Collectors.joining(" "));
String wordsAfter = words.subList(words.size() - wordsAfterCnt, words.size())
.stream()
.collect(Collectors.joining(" "));
List<String> homogeneousPart = words.subList(wordsBeforeCnt, words.size() - wordsAfterCnt);
Set<String> splitWords = new HashSet<>(Arrays.asList(",", "and"));
Set<String> simplifiedSentences = new HashSet<>();
StringBuilder sb = new StringBuilder(wordsBefore);
for (int i = 0; i < homogeneousPart.size(); i++) {
String part = homogeneousPart.get(i);
if (!splitWords.contains(part)) {
sb.append(" ").append(part);
if (i == homogeneousPart.size() - 1) {
sb.append(" ").append(wordsAfter).append(" ");
simplifiedSentences.add(sb.toString());
}
} else {
sb.append(" ").append(wordsAfter).append(" ");
simplifiedSentences.add(sb.toString());
sb = new StringBuilder(wordsBefore);
}
}
return simplifiedSentences;
So e.g. sentence
I love and kiss and adore my beautiful mom, clever dad and sister.
will be simplified into 9 sentences if we are using 2 regexps above
I adore my clever dad .
I love my clever dad .
I love my sister .
I kiss my sister .
I kiss my clever dad .
I adore my sister .
I love my beautiful mom .
I adore my beautiful mom .
I kiss my beautiful mom .
These code works only with 3 or more homogeneous words, cause for 2 words there are lots of exceptions. E.g.
Cat eats mouse, dog eats meat.
Than sentence can't be simplified these way.
Upvotes: 1