Clause extraction / long sentence segmentation in python

Question

I'm currently working on a project involving sentence vectors (from a RoBERTa pretrained model). These vectors are lower quality when sentences are long, and my corpus contains many long sentences with subclauses.

I've been looking for methods for clause extraction / long sentence segmentation, but I was surprised to see that none of the major NLP packages (e.g., spacy or stanza) offer this out of the box.

I suppose this could be done by using either spacy or stanza's dependency parsing, but it would probably be quite complicated to handle all kinds of convoluted sentences and edge cases properly.

I've come across this implementation of the the ClausIE information extraction system with spacy that does something similar, but it hasn't been updated and doesn't work on my machine.

I've also come across this repo for sentence simplification, but I get an annotation error from Stanford coreNLP when I run it locally.

Is there any obvious package/method that I've overlooked? If not, is there a simple way to implement this with stanza or spacy?

Clause extraction / long sentence segmentation in python

Answers (1)

Related Questions