Reputation: 53
I would like to write a python program that is able to transform a syntactically complex sentence into (potentially several) less complex sentences which I can use in downstream tasks.
Complex sentence: "Tonight I'm going to play soccer with my friends before we're going to watch a movie in the theater next to the city center."
Simpler text with same content: "Tonight I'm going to play soccer with my friends. Then we're going to watch a movie in the theater. The theater is next to the city center."
May goal is that the final output consist only of simple main clauses (no coordinate clauses, no subordinate clauses, no relative clauses etc.). Essentially, the resulting clauses should end up having one subject, one predicate, one direct object and potentially one indirect/prepositional object (each may have a modifier such as e.g. an adjective attribute). If there are multiple of any of these, I don't mind ending up with repetitions ("I love mum and dad." --> "I love mum. I love dad.").
I have started implementing a relative clause resolver and coordination resolver so far. They do work quite ok. However, there are many more cases to cover (causal, temporal, adversative subclauses etc.) and I started to wonder if someone out there may have a better idea about how to tackle this.
Also, I heavily rely on spaCy but I'm running more and more into problems as transforming the document (which I do when I transform the text) is against a core principle of spaCy. Hence again: Maybe I should use a different approach altogether?
Thanks for any thoughts...
Upvotes: 2
Views: 1128
Reputation: 4349
polm23's answer is right that this is an active research area. As of June 2022, your best bet for an off-the-shelf model that you can run locally, or as close as research code will get to that, is to use GECToR for TST. The model and code are open source. You need to run them on a GPU, and it will likely take trial and error to get it going since the docs are minimal.
If you want to train a model that will likely beat this performance, you can follow TST's data augmentation approach to get around 1M training pairs, then fine tune the best version of T5 that you can afford on the task, for example t5-efficient-base-nl36. If you have less compute, use a smaller T5, and if you have more, use T0_3B
The easiest and likely best answer is "ask GPT-3 to do it." Using the prompt:
Simplify this sentence by breaking it into multiple sentences:
Original: Tonight I'm going to play soccer with my friends before we're going to watch a movie in the theater next to the city center.
Simplified: Tonight I'm going to play soccer with my friends. Then we're going to watch a movie in the theater. The theater is next to the city center.
---
Original: Essentially, the resulting clauses should end up having one subject, one predicate, one direct object and potentially one indirect/prepositional object (each may have a modifier such as e.g. an adjective attribute)
Simplified:
produces the output:
The resulting clauses should ideally have one subject, one predicate, one direct object, and potentially one indirect/prepositional object. Each object may have a modifier, such as an adjective attribute.
If you give it more than one example, it will probably do even better. (Edit: these results were using the model text-davinci-002)
Upvotes: 2
Reputation: 15593
What you are attempting to do is called "sentence simplification". It's an active research topic and there is no simple solution, not even a robust library you can use (as far as I'm aware). The best thing you can do is read research papers and implement them, look for released models, or do some good-enough processing with dependency parsing, I'm afraid. Look here for an overview of some research.
You're right that rewriting tasks are not something spaCy is designed for. However, if sentence-level alignment is enough, I think it shouldn't be hard to store your rewritten sentences as span extensions attached to each sentence.
If you need to manipulate the dependency relations directly I don't think there's anything that's as easy to use as spaCy.
Upvotes: 2