noobie
noobie

Reputation: 2607

How to match number and text in same token - Spacy Matcher?

I have following sentence and I wanted to extract '12am' out of it.

He is working at 12am

I am using the Spacy Matcher (language model en_core_web_lg) and it breaks the text into the following tokens:

[He] [is] [working] [at] [12am]

And the patterns I tried are:

[{ "LIKE_NUM": true }, {"IS_SPACE": false}, { "LOWER": "am" }],
[{ "LIKE_NUM": true , "LOWER": "am" }],
[{ "SHAPE": 'dd' , "ORTH": "am" }]

Nothing works so far. Basically since the token is [12am].

I need help to create pattern for matching:

Any advice appreciated. Thanks

Upvotes: 2

Views: 3583

Answers (1)

Tiago Duque
Tiago Duque

Reputation: 2069

No need to use spaCy for that, you can use simple regex. But, if you want to use spaCy, I'll present how to make use of spaCy matcher regex functionality below.

Using Regex

Pattern: [0-9]+[,.]?[0-9]+[ ]?[A-Za-z]+

Explanation: you look for any repetition of numbers of 1+ characters ([0-9]+). then there's an optional dot, comma ([,.]?) and other chars ([0-9]+). Then, there's an optional white space([ ]?) followed by upper or lowercase characters ([A-Za-z]+).

You can modify that to exclude white spaces, if that's your case.

Here's a live example: https://regex101.com/r/HmTKD7/1

In python:

import re
pattern = r'[0-9]+[,.]?[0-9]+[ ]?[A-Za-z]+'
results = re.findall(pattern, text)

Using spaCy matcher:

In spaCy you could do the following matcher:

pattern = [{"TEXT": {"REGEX": "[0-9]+[,.]?[0-9]+[A-Za-z]+"}}]

Just remember that, if there's a whitespace between the number and the measure type, spacy will break into two tokens. That's why the regex for the pattern does not involve white space.

Currently there's no way to present a live demo using REGEX in https://explosion.ai/demos/matcher, but REGEX is in spaCy matcher since v2.1.

Upvotes: 4

Related Questions