Nir Elbaz
Nir Elbaz

Reputation: 616

customize Tokenizer in spacy

I am using Spacy v2

I looking for dates in a doc , I want that the tokenizer will merge them

For example:

doc= 'Customer: Johnna 26 06 1989'

the default tokenizer results looks like :

('Customer:', 'customer:', 'NUM', 'CD', 'amod', 'Xxxxx:', False, False)
('Johnna', 'Johnna ', 'PROPN', 'NNP', 'ROOT', 'xxxx', True, False)
('26', '26', 'NUM', 'CD', 'compound', 'dd', False, False)
('06', '06', 'NUM', 'CD', 'appos', 'dd', False, False)
('1989', '1989', 'NUM', 'CD', 'nummod', 'dddd', False, False)

While I want it to look like :

('Customer:', 'customer:', 'NUM', 'CD', 'amod', 'Xxxxx:', False, False)
('Johnna', 'Johnna ', 'PROPN', 'NNP', 'ROOT', 'xxxx', True, False)
('26 06 1989', '26', 'NUM', 'CD', 'compound', 'dd dd dd', False, False)

I tried to create customize tokenizer , but I am not sure if I need to change the prefix or the suffix_ and how to define the case.

def __customize_tokenizer(self):
        prefix_re = re.compile(r'\d+\s+\d+')    
        return Tokenizer(self._nlp.vocab, prefix_search = prefix_re.search)

Thanks,

Nir

Upvotes: 0

Views: 1415

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You can just use the nlp.add_pipe("merge_entities"):

Merge named entities into a single token. Also available via the string name "merge_entities".

See a Python snippet:

doc = nlp('Customer: Johnna 26 06 1989')
nlp.add_pipe("merge_entities")
print([(t.text, t.pos_, t.lemma_) for t in doc])
# => [
#      ('Customer', 'NOUN', 'customer'), 
#      (':', 'PUNCT', ':'), 
#      ('Johnna', 'PROPN', 'Johnna'),
#      ('26 06 1989', 'NUM', '26 06 1989')
#    ]

Upvotes: 1

aab
aab

Reputation: 11474

The tokenizer algorithm doesn't support this kind of pattern: it doesn't support regexes in its exceptions and the affix patterns aren't applied across whitespace.

Instead, one option is to find these cases with the Matcher, which does support regexes, and use the retokenizer to merge the tokens:

import spacy
from spacy.matcher import Matcher

nlp = spacy.blank("en")
matcher = Matcher(nlp.vocab)
matcher.add("DATE", [[{"ORTH": {"REGEX": "\d\d"}}, {"ORTH": {"REGEX": "\d\d"}}, {"ORTH": {"REGEX": "\d\d\d\d"}}]])

text = "This is a date 01 02 2000 in a sentence."

doc = nlp(text)

with doc.retokenize() as retokenizer:
    for match_id, start, end in matcher(doc):
        retokenizer.merge(doc[start:end])

print([t.text for t in doc])
# ['This', 'is', 'a', 'date', '01 02 2000', 'in', 'a', 'sentence', '.']

If you want, you can put the matching and retokenization into a custom component at the beginning of your pipeline, see: https://v2.spacy.io/usage/processing-pipelines#custom-components

Upvotes: 4

Related Questions