Reputation: 493
I'm currently working on a project involving sentence vectors (from a RoBERTa pretrained model). These vectors are lower quality when sentences are long, and my corpus contains many long sentences with subclauses.
I've been looking for methods for clause extraction / long sentence segmentation, but I was surprised to see that none of the major NLP packages (e.g., spacy or stanza) offer this out of the box.
I suppose this could be done by using either spacy or stanza's dependency parsing, but it would probably be quite complicated to handle all kinds of convoluted sentences and edge cases properly.
I've come across this implementation of the the ClausIE information extraction system with spacy that does something similar, but it hasn't been updated and doesn't work on my machine.
I've also come across this repo for sentence simplification, but I get an annotation error from Stanford coreNLP when I run it locally.
Is there any obvious package/method that I've overlooked? If not, is there a simple way to implement this with stanza or spacy?
Upvotes: 9
Views: 7493
Reputation: 15623
Here is code that works on your specific example. Expanding this to cover all cases is not simple, but can be approached over time on an as-needed basis.
import spacy
import deplacy
en = spacy.load('en_core_web_sm')
text = "This all encompassing experience wore off for a moment and in that moment, my awareness came gasping to the surface of the hallucination and I was able to consider momentarily that I had killed myself by taking an outrageous dose of an online drug and this was the most pathetic death experience of all time."
doc = en(text)
#deplacy.render(doc)
seen = set() # keep track of covered words
chunks = []
for sent in doc.sents:
heads = [cc for cc in sent.root.children if cc.dep_ == 'conj']
for head in heads:
words = [ww for ww in head.subtree]
for word in words:
seen.add(word)
chunk = (' '.join([ww.text for ww in words]))
chunks.append( (head.i, chunk) )
unseen = [ww for ww in sent if ww not in seen]
chunk = ' '.join([ww.text for ww in unseen])
chunks.append( (sent.root.i, chunk) )
chunks = sorted(chunks, key=lambda x: x[0])
for ii, chunk in chunks:
print(chunk)
deplacy is optional but I find it useful for visualizing dependencies.
Also, I see you express surprise this is not an inherent feature of common NLP libraries. The reason for that is simple - most applications don't need this, and while it seems like a simple task it actually ends up being really complicated and application specific the more cases you try to cover. On the other hand, for any specific application, like the example I gave it's relatively easy to hack together a good-enough solution.
Upvotes: 8