How to specify spaCy to recognize a sentence based on full stop

Question

I had the following text

text = 'Shop 1 942.10 984.50 1023.90 1064.80 

Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 

Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 

Shop 3 1059.40 1107.10 1151.40 1197.40 

Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 

Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40'

Which I cleaned up by replacing with '. ' with this code

text = text.replace('

', '. ')

I constructed a matcher with a simple and generic pattern like this

nlp = spacy.load('en_core_web_lg',  disable=['ner'])
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}] 
matcher.add('REV', None, pattern)

I then used the matcher to find all the sentences which were separated in the text by the full stop .

matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)         
    print(matched_span.sent, '
')

I expected to obtain these results:

Shop 1
Shop 1 942.10 984.50 1023.90 1064.80 . 

Shop 2
Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 . 

Shop 2
Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 . 

Shop 3
Shop 3 1059.40 1107.10 1151.40 1197.40 . 

Shop 4
Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 . 

Shop 4
Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40

However, due to the way spaCy processed the text, it didn't split the sentences by the full stop . but instead by some opaque rules that I don't know what they are, my code returned the following results

Shop 1
Shop 1 942.10 

Shop 2
Shop 2 first 12 months 

Shop 2
Shop 2 after 12 months 1045.50 1092.60 

Shop 3
Shop 3 

Shop 4
Shop 4 first 3 months 

Shop 4
Shop 4 after 3 months

Is there a way to instruct/override spaCy how to recognize a sentence in a text based on a specific pattern (full stop . in this case)?

PaSTE · Accepted Answer

What you probably want to do is define a custom sentence segmentizer. The default sentence segmentizing algorithm spaCy employs uses a dependency tree to attempt to figure out where sentences begin and end. You can override this by making your own function that defines sentence boundaries and adding it to the NLP pipeline. Following the example in spaCy's documentation:

import spacy

def custom_sentencizer(doc):
    ''' Look for sentence start tokens by scanning for periods only. '''
    for i, token in enumerate(doc[:-2]):  # The last token cannot start a sentence
        if token.text == ".":
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False  # Tell the default sentencizer to ignore this token
    return doc

nlp = spacy.load('en_core_web_lg',  disable=['ner'])
nlp.add_pipe(custom_sentencizer, before="parser")  # Insert before the parser can build its own sentences
# text = ...
doc = nlp(text)

matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}] 
matcher.add('REV', None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = nlp(text2)[start:end]
    print(matched_span.text)         
    print(matched_span.sent, '
')

# Shop 1
# Shop 1 942.10 984.50 1023.90 1064.80 .
# 
# Shop 2
# Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 .
# 
# Shop 2
# Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 .
# 
# Shop 3
# Shop 3 1059.40 1107.10 1151.40 1197.40 .
# 
# Shop 4
# Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 .
# 
# Shop 4
# Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40

Your text is very different from natural language, so it's no wonder spaCy doesn't do a great job. Its internal models are trained on examples that look explicitly like text you would read in a book or on the Internet, while your example looks more like machine-readable lists of numbers. For instance, if the text you used were written out more like prose, it might look something like this:

Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80. Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20. Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70. Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40. Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30. After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.

Using this as the input gives spaCy's default parser a much better chance at figuring out where the sentence breaks are, even with all those other punctuation marks:

text2 = "Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.  Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.  Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.  Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.  Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.  After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40."

nlp2 = spacy.load('en_core_web_lg',  disable=['ner'])  # default sentencizer
doc2 = nlp2(text2)
matches2 = matcher(doc2)  # same matcher
for match_id, start, end in matches2:
    matched_span = nlp2(text2)[start:end]
    print(matched_span.text)
    print(matched_span.sent, '
')

# Shop 1
# Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.
#
# Shop 2
# Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.
#
# Shop 2
# Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.
#
# Shop 3
# Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.
#
# Shop 4
# Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.
#
# Shop 4
# After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.

Note that this is not fool-proof, and the default parser will still mess up if the sentence structure gets too complex or fancy. NLP in general, and spaCy in particular, is not about parsing a small data set to extract particular values exactly right every time: it's more about parsing gigabytes of documents quickly and doing a good enough job in a statistical sense to perform meaningful computations on the data.

How to specify spaCy to recognize a sentence based on full stop

Answers (1)

Related Questions