Danila Zharenkov
Danila Zharenkov

Reputation: 1863

Stanford CoreNLP find homogeneous parts of sentence

I'm trying to build sentence simplification algorithm based on Stanford CoreNLP. One of simplification I want to do - transform sentence with homogeneous parts of sentence to several sentences. E.g.

I love my mom, dad and sister. -> I love my mom. I love my dad. I love my sister.

First of all I build semantic graph for input sentence string

    final Sentence parsed = new Sentence(sentence);
    final SemanticGraph dependencies = parsed.dependencyGraph();

The dependency graph for this sentence is

-> love/VBP (root)
  -> I/PRP (nsubj)
  -> mom/NN (dobj)
    -> my/PRP$ (nmod:poss)
    -> ,/, (punct)
    -> dad/NN (conj:and)
    -> and/CC (cc)
    -> sister/NN (conj:and)
  -> dad/NN (dobj)
  -> sister/NN (dobj)

Then I found dobj edges in the graph and nsubj

for (SemanticGraphEdge edge : dependencies.edgeListSorted()) {
        if (edge.getRelation().getShortName().startsWith("dobj")) {
            modifiers.add(edge);
        } else if (edge.getRelation().getShortName().startsWith("nsubj")) {
            subj = edge;
        }
    }

So now I have 3 edges in modifiers and nsubj with I word. And now my proble is how to split the semantic graph into 3 separate graphs. Of course naive solution was just to build sentence base on subj and governor/dependent from dobj edges, but I understand that it's a bad idea and won't work on more complicated examples.

for (final SemanticGraphEdge edge : modifiers) {
                SemanticGraph semanticGraph = dependencies.makeSoftCopy();
                final IndexedWord governor = edge.getGovernor();
                final IndexedWord dependent = edge.getDependent();

                final String governorTag = governor.backingLabel().tag().toLowerCase();
                if (governorTag.startsWith("vb")) {
                    StringBuilder b = new StringBuilder(subj.getDependent().word());
                    b.append(" ")
                            .append(governor.word())
                            .append(" ")
                            .append(dependent.word())
                            .append(". ");
                    System.out.println(b);

                }
            }

Can anyone give me some advices? Maybe I missed something useful in coreNLP documentation? Thanks.

Upvotes: 1

Views: 204

Answers (1)

Danila Zharenkov
Danila Zharenkov

Reputation: 1863

Thanks to @JosepValls for the great idea. Here some code samples how I simplify sentences with 3 or more homogeneous words.

First of all, I defined several regexps for cases

jj(optional) nn, jj(optional) nn, jj(optional) nn and jj(optional) nn
jj(optional) nn, jj(optional) nn, jj(optional) nn , jj(optional) nn ...
jj , jj , jj
jj , jj and jj
vb nn(optional) , vb nn(optional) , vb nn(optional)
 and  so on

Regexps are

Pattern nounAdjPattern = Pattern.compile("(((jj)\\s(nn)|(jj)|(nn))\\s((cc)|,)\\s){2,}((jj)\\s(nn)|(jj)|(nn))");
Pattern verbPatter = Pattern.compile("((vb\\snn|vb)\\s((cc)|,)\\s){2,}((vb\\snn)|vb)");

These pattern will be used to define does input sentence have list of homogeneous word or not and to find boundaries. After that I create list of POSes based on words from original sentence

final Sentence parsed = new Sentence(sentence);
final List<String> words = parsed.words();
List<String> pos = parsed.posTags().stream()
        .map(tag -> tag.length() < 2 ? tag.toLowerCase() : tag.substring(0, 2).toLowerCase())
        .collect(Collectors.toList());

To match this POS structure with regexpes - concat list to string

String posString = pos.stream().collect(Collectors.joining(" "));

If sentence doesn't match any regex - lets return the same string, other way - lets simplify it.

if (!matcher.find()) {
    return new SimplificationResult(Collections.singleton(sentence));
}
return new SimplificationResult(simplify(posString, matcher, words));

In simplify method I'm looking for the boundaries of homogeneous part and extract from words list 3 part - begining and ending, which won't change and homogeneous part, which will be derived into parts. And after deriving homogenous part into pieces - I build several simplified sentences like beginning+piece+ending.

 private Set<String> simplify(String posString, Matcher matcher, List<String> words) {
        String startPOS = posString.substring(0, matcher.start());
        String endPPOS = posString.substring(matcher.end());
        int wordsBeforeCnt = StringUtils.isEmpty(startPOS) ? 0 : startPOS.trim().split("\\s+").length;
        int wordsAfterCnt = StringUtils.isEmpty(endPPOS) ? 0 : endPPOS.trim().split("\\s+").length;
        String wordsBefore = words.subList(0, wordsBeforeCnt)
                .stream()
                .collect(Collectors.joining(" "));
        String wordsAfter = words.subList(words.size() - wordsAfterCnt, words.size())
                .stream()
                .collect(Collectors.joining(" "));
        List<String> homogeneousPart = words.subList(wordsBeforeCnt, words.size() - wordsAfterCnt);
        Set<String> splitWords = new HashSet<>(Arrays.asList(",", "and"));
        Set<String> simplifiedSentences = new HashSet<>();
        StringBuilder sb = new StringBuilder(wordsBefore);
        for (int i = 0; i < homogeneousPart.size(); i++) {
            String part = homogeneousPart.get(i);
            if (!splitWords.contains(part)) {
                sb.append(" ").append(part);
                if (i == homogeneousPart.size() - 1) {
                    sb.append(" ").append(wordsAfter).append(" ");
                    simplifiedSentences.add(sb.toString());
                }
            } else {
                sb.append(" ").append(wordsAfter).append(" ");
                simplifiedSentences.add(sb.toString());
                sb = new StringBuilder(wordsBefore);
            }
        }
        return simplifiedSentences;

So e.g. sentence

 I love and kiss and adore my beautiful mom, clever dad and sister.

will be simplified into 9 sentences if we are using 2 regexps above

I adore my clever dad . 
I love my clever dad . 
I love my sister . 
I kiss my sister . 
I kiss my clever dad . 
I adore my sister . 
I love my beautiful mom . 
I adore my beautiful mom . 
I kiss my beautiful mom . 

These code works only with 3 or more homogeneous words, cause for 2 words there are lots of exceptions. E.g.

Cat eats mouse, dog eats meat.

Than sentence can't be simplified these way.

Upvotes: 1

Related Questions