user11749375
user11749375

Reputation:

NLP problems to handle sentence with conjunctions

What I would like to do

I would like to preprocess sentences include conjunctions like below. I don’t care the tense of verb and transformation following the subject. What I want to is to hold new two sentences that have subjects and verbs individually.

**Pattern1**
They entered the house and she glanced at the dark fireplace.
["They entered the house ", "she glanced at the dark fireplace"]

**Pattern2** 
Felipa and Alondra sing a song.
["Felipa sing a song”, "Alondra sing a song"]

**Pattern3**
“Jessica watches TV and eats dinner.
["Jessica watch TV, “Jessica eat dinner”]

Problem

I was able to solve the sentence of Pattern1 with the below code, but I'm stack with thinking the solutions for Pattern2 and 3 with the below code no.2.

With using the NLP library spaCy, I was able to figure out conjunctions is recognized as CCONJ. However, there is no clues to realize what I want to do like the above.

Please give me your advice!

Current Code

Pattern1

text = "They entered the house and she glanced at the dark fireplace."
if 'and' in text:
    text = text.replace('and',',')
    l = [x.strip() for x in text.split(',') if not x.strip() == '']
l

#output
['They entered the house', 'she glanced at the dark fireplace.']

working code

text = "Felipa and Alondra sing a song."
doc_dep = nlp(text)
for k in range(len(doc_dep)):
    token = doc_dep[k]
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_) 
    if token.pos_ == 'CCONJ':
        print(token.text)

#output
Felipa felipa NOUN NN nsubj
     SPACE _SP 
and and CCONJ CC cc
and
     SPACE _SP 
Alondra Alondra PROPN NNP nsubj
sing sing VERB VBP ROOT
a a DET DT det
song song NOUN NN dobj
. . PUNCT . punct
text = "Jessica watches TV and eats dinner."
doc_dep = nlp(text)
for k in range(len(doc_dep)):
    token = doc_dep[k]
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_) 
    if token.pos_ == 'CCONJ':
        print(token.text)
#output
Jessica Jessica PROPN NNP nsubj
watches watch VERB VBZ ROOT
TV tv NOUN NN dobj
and and CCONJ CC cc
and
eats eat VERB VBZ conj
dinner dinner NOUN NN dobj
. . PUNCT . punct

Development Environment

python 3.7.4

spaCy version 2.3.1

jupyter-notebook : 6.0.3

Upvotes: 3

Views: 2081

Answers (2)

B89
B89

Reputation: 41

Another way to solve this is by implementing a custom sentence boundary detection component. This component needs to be placed before Spacy's parser.

Please take a look at this solution that uses SBD component for segmenting a sentence. You can also use regex to find coordinated conjunctions like and/or, but.

Upvotes: 0

Gabriel
Gabriel

Reputation: 587

There's no reason to think that the same code should be able to handle all of these situations, as the function of the word "and" is very different in each case. In Pattern 1, it is connecting two independent clauses. In Pattern 2, it is creating a compound subject. In Pattern 3, it is coordinating verb phrases.

I would caution you that if your ultimate aim is to 'split' all sentences that contain the word 'and' (or any other coordinating conjunction) in this way, you have a very challenging job ahead of you. Coordinating conjunctions function in many different ways in English. There are many common patterns different from those you list here, such as nonconstituent coordination ("Bill went to Chicago on Wednesday and New York on Thursday", which you'd presumably want to turn into ["Bill went to Chicago on Wednesday", "Bill went to New York on Thursday"]) -- note the subtle but critical difference from "Bill went to Chicago and New York on Thursday", which would need to become ["Bill went to Chicago on Thursday", "Bill went to New York on Thursday"]; coordinated verbs ("Mary saw and heard him walk up the steps"), among others. And of course more than two constituents can be coordinated ("Sarah, John, and Marcia..."), and many patterns can all be combined in the same sentence.

English is complicated and handling this would be a huge job, even for a linguist with a strong command of what is going on syntactically in all the cases to be covered. Even just characterizing how English coordinations behave is tough, as this paper that considers just a handful of patterns illustrates. If you consider that your code would have to handle real-world sentences with multiple 'and's doing different things (e.g., "Autonomous cars shift insurance liability and moral responsibility toward manufacturers, and it doesn't look like this will change anytime soon"), the complexity of the task becomes clearer.

That said, if you are only interested in handling the most common and simple cases, you might be able to make at least some headway by processing the results of a constituency parser like the one built into NLTK, or a SpaCy plugin like benepar. That at least would clearly show you what elements of the sentence are being coordinated by the conjunction.

I don't know what your ultimate task is so I can't say this with confidence, but I'm skeptical that the gains you get by preprocessing in this way will be worth the effort. You might consider stepping back and thinking about the ultimate task you are trying to achieve, and researching (and/or asking StackOverflow) whether there are any preprocessing steps that are known to generally improve performance.

Upvotes: 1

Related Questions