With Spacy, how to indicate that a part of a fixed pattern can be seperated by one or multiple words from the last part of the pattern?

Question

I am using Spacy library matcher to extract negative sentences in French which contains specific pattern. The key word is a connector "mais" but the negation can be at the beguinning or ending of the sentences. My problem is that , I do not know in advance the numbers of words which seperate the negative pattern from the connector. Is there a way in spacy to tell him that they can be 0 to infinite words which can seperate the negative pattern from the connector ?

exemple :


je ne crois pas mais ce n\'est pas ça' => "je ne crois pas" precede directly "mais"
je ne crois pas ce qu'il a fait mais ce n\'est pas ça'=> for this case there is a bunch of other words

This is how I do :

file = ['Les hommes et les femmes ne sont pas pareils, mais on n\'y comprend rien.',
'Les hommes et les femmes ne sont pas pareils comme d\'autres, mais on n\'y comprend rien.',
'L\'auteur décrit dans le détail (à son avis) ce qui ne faut pas faire mais ne donne aucun conseil, ce qui est de sa part une preuve de modestie puisqu\'il n\'a jamais pu être publié !', 
'"""Il fait de la variété ce qui n\'est pas déchoir, mais il ne joue pas """"classique"""" , il est à la musique classique ce que le reader\'s digest est à la littérature"""',
'Si vous n\'aimez pas Bach mais la peinture Rippolin, ce disque est pour vous.',
'Non, tout ça ne vaut pas en effet, mais pas du tout Le Petit Prince ou Livingstone le Goéland.',
'j\'aime bien et je ne crois pas mais ce n\'est pas ça',
'je ne comprends rien ça va mais pourquoi il fait ça , je n\'y comprends rien']



pattern0 = [{"POS": "ADV"}, 
          {"POS": {"IN": ["AUX","VERB"]}},
           {"LEMMA": {"IN": ["pas", "plus", "aucun", "rien"]}},

            {"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"},
            {"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"},
            {"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"},
            {"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"},
            {"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"},
            
           {"IS_PUNCT": True, "OP": "*"},  {"LOWER": "mais"}
            ]

matcher = Matcher(nlp.vocab)  
matcher.add("matching_0", None, pattern0) 

sent_extract2=[] #list of extracted sentences with last attribute IN THE lexique
sent_not_extract2=[] #list of extracted sentences with last attribute NOT IN THE lexique

for sent in file:
    doc=nlp(sent)
    matches= matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end].lemma_.split()
        print(sent)
        print("found match:", span)
        
sent_extract2 = set(sent_extract2)
# Displays the list of extracted sentences
print('
List of extracted  sentences pattern1')
print(len(sent_extract2))
print('
Other sentences')
print(len(set(sent_not_extract2)))

Resultats :

Les hommes et les femmes ne sont pas pareils, mais on n'y comprend rien.
found match: ['ne', 'être', 'pas', 'pareil', ',', 'mais']
L'auteur décrit dans le détail (à son avis) ce qui ne faut pas faire mais ne donne aucun conseil, ce qui est de sa part une preuve de modestie puisqu'il n'a jamais pu être publié !
found match: ['ne', 'falloir', 'pas', 'faire', 'mais']
"""Il fait de la variété ce qui n'est pas déchoir, mais il ne joue pas """"classique"""" , il est à la musique classique ce que le reader's digest est à la littérature"""
found match: ["n'", 'être', 'pas', 'déchoir', ',', 'mais']
Si vous n'aimez pas Bach mais la peinture Rippolin, ce disque est pour vous.
found match: ["n'", 'aimer', 'pas', 'Bach', 'mais']
je ne comprends rien ça va mais pourquoi il fait ça , je n'y comprends rien
found match: ['ne', 'comprendre', 'rien', 'cela', 'aller', 'mais']

This solution seems to work but it is a little tacky because I constantly have to add the same line to correspond to " a foreign word " How can I rewrite this line " {"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"}," so that it will be counted 0 or more by the matcher without dupplicating it.

I also try "{}" but it only equal a word and not 0 to more

Natalia · Accepted Answer

You will need to replace 5 lines of

{"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "?"},

with one line:

{"POS": {"IN": ["ADJ", "AUX", "VERB", "NOUN", "ADV","PRON", "PROPN"]}, "OP": "*"},

Like in regex, spacy's 'OP' can be '?' - one or more, '*' - zero or more, '!' - not

With Spacy, how to indicate that a part of a fixed pattern can be seperated by one or multiple words from the last part of the pattern?

Answers (1)

Related Questions