Reputation: 2607
I have following sentence and I wanted to extract '12am' out of it.
He is working at 12am
I am using the Spacy Matcher (language model en_core_web_lg) and it breaks the text into the following tokens:
[He] [is] [working] [at] [12am]
And the patterns I tried are:
[{ "LIKE_NUM": true }, {"IS_SPACE": false}, { "LOWER": "am" }],
[{ "LIKE_NUM": true , "LOWER": "am" }],
[{ "SHAPE": 'dd' , "ORTH": "am" }]
Nothing works so far. Basically since the token is [12am].
I need help to create pattern for matching:
Any advice appreciated. Thanks
Upvotes: 2
Views: 3583
Reputation: 2069
No need to use spaCy for that, you can use simple regex. But, if you want to use spaCy, I'll present how to make use of spaCy matcher regex functionality below.
Using Regex
Pattern: [0-9]+[,.]?[0-9]+[ ]?[A-Za-z]+
Explanation: you look for any repetition of numbers of 1+ characters ([0-9]+). then there's an optional dot, comma ([,.]?) and other chars ([0-9]+). Then, there's an optional white space([ ]?) followed by upper or lowercase characters ([A-Za-z]+).
You can modify that to exclude white spaces, if that's your case.
Here's a live example: https://regex101.com/r/HmTKD7/1
In python:
import re
pattern = r'[0-9]+[,.]?[0-9]+[ ]?[A-Za-z]+'
results = re.findall(pattern, text)
Using spaCy matcher:
In spaCy you could do the following matcher:
pattern = [{"TEXT": {"REGEX": "[0-9]+[,.]?[0-9]+[A-Za-z]+"}}]
Just remember that, if there's a whitespace between the number and the measure type, spacy will break into two tokens. That's why the regex for the pattern does not involve white space.
Currently there's no way to present a live demo using REGEX in https://explosion.ai/demos/matcher, but REGEX is in spaCy matcher since v2.1.
Upvotes: 4