Reputation: 1227
I had the following text
text = 'Shop 1 942.10 984.50 1023.90 1064.80 \n\nShop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 \n\nShop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 \n\nShop 3 1059.40 1107.10 1151.40 1197.40 \n\nShop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 \n\nShop 4 after 3 months 1082.40 1131.10 1176.40 1223.40'
Which I cleaned up by replacing \n\n
with '. '
with this code
text = text.replace('\n\n', '. ')
I constructed a matcher
with a simple and generic pattern like this
nlp = spacy.load('en_core_web_lg', disable=['ner'])
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}]
matcher.add('REV', None, pattern)
I then used the matcher
to find all the sentences which were separated in the text by the full stop .
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent, '\n')
I expected to obtain these results:
Shop 1
Shop 1 942.10 984.50 1023.90 1064.80 .
Shop 2
Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 .
Shop 2
Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 .
Shop 3
Shop 3 1059.40 1107.10 1151.40 1197.40 .
Shop 4
Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 .
Shop 4
Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40
However, due to the way spaCy processed the text
, it didn't split the sentences
by the full stop .
but instead by some opaque rules that I don't know what they are, my code returned the following results
Shop 1
Shop 1 942.10
Shop 2
Shop 2 first 12 months
Shop 2
Shop 2 after 12 months 1045.50 1092.60
Shop 3
Shop 3
Shop 4
Shop 4 first 3 months
Shop 4
Shop 4 after 3 months
Is there a way to instruct/override spaCy how to recognize a sentence in a text based on a specific pattern (full stop .
in this case)?
Upvotes: 1
Views: 2246
Reputation: 4548
What you probably want to do is define a custom sentence segmentizer. The default sentence segmentizing algorithm spaCy employs uses a dependency tree to attempt to figure out where sentences begin and end. You can override this by making your own function that defines sentence boundaries and adding it to the NLP pipeline. Following the example in spaCy's documentation:
import spacy
def custom_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_lg', disable=['ner'])
nlp.add_pipe(custom_sentencizer, before="parser") # Insert before the parser can build its own sentences
# text = ...
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'Shop'}, {'LIKE_NUM': True}]
matcher.add('REV', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = nlp(text2)[start:end]
print(matched_span.text)
print(matched_span.sent, '\n')
# Shop 1
# Shop 1 942.10 984.50 1023.90 1064.80 .
#
# Shop 2
# Shop 2 first 12 months 1032.70 1079.10 1122.30 1167.20 .
#
# Shop 2
# Shop 2 after 12 months 1045.50 1092.60 1136.30 1181.70 .
#
# Shop 3
# Shop 3 1059.40 1107.10 1151.40 1197.40 .
#
# Shop 4
# Shop 4 first 3 months 1072.60 1120.90 1165.70 1212.30 .
#
# Shop 4
# Shop 4 after 3 months 1082.40 1131.10 1176.40 1223.40
Your text is very different from natural language, so it's no wonder spaCy doesn't do a great job. Its internal models are trained on examples that look explicitly like text you would read in a book or on the Internet, while your example looks more like machine-readable lists of numbers. For instance, if the text you used were written out more like prose, it might look something like this:
Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80. Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20. Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70. Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40. Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30. After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.
Using this as the input gives spaCy's default parser a much better chance at figuring out where the sentence breaks are, even with all those other punctuation marks:
text2 = "Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80. Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20. Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70. Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40. Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30. After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40."
nlp2 = spacy.load('en_core_web_lg', disable=['ner']) # default sentencizer
doc2 = nlp2(text2)
matches2 = matcher(doc2) # same matcher
for match_id, start, end in matches2:
matched_span = nlp2(text2)[start:end]
print(matched_span.text)
print(matched_span.sent, '\n')
# Shop 1
# Shop 1's numbers were 942.10, 984.50, 1023.90, and 1064.80.
#
# Shop 2
# Shop 2, for the first 12 months, had numbers 1032.70, 1079.10, 1122.30, and 1167.20.
#
# Shop 2
# Shop 2, after 12 months, had 1045.50, 1092.60, 1136.30, and 1181.70.
#
# Shop 3
# Shop 3: 1059.40, 1107.10, 1151.40, and 1197.40.
#
# Shop 4
# Shop 4 in the first 3 months had numbers 1072.60, 1120.90, 1165.70, and 1212.30.
#
# Shop 4
# After 3 months, Shop 4 had 1082.40, 1131.10, 1176.40, and 1223.40.
Note that this is not fool-proof, and the default parser will still mess up if the sentence structure gets too complex or fancy. NLP in general, and spaCy in particular, is not about parsing a small data set to extract particular values exactly right every time: it's more about parsing gigabytes of documents quickly and doing a good enough job in a statistical sense to perform meaningful computations on the data.
Upvotes: 1