python Spacy custom NER – how to prepare multi-words entities?

Question

:) Please help :)

I`m preparing custom Name Entity Recognition using Spacy (blank) model. I use only one entity: Brand (we can name it 'ORG' as Organisation). I have short texts with ORGs & have prepared data like this (but I can change it):

train_data = [ 
    (‘First text in string with Name I want’, {'entities': [(START, END, ‘ORG')]}),
    (‘Second text with Name and Name2’, {'entities': [(START, END, ‘ORG'), (START2, END2, ‘ORG')]})
 ]

START, END – are the start and end indexes of the brand name in text , of course.

This is working well, but...

The problem I have is how to prepare entities for Brands that are made of 2 (or more) words. Lets say Brand Name is a full name of a company. How to prepare an entity?

Consider the tuple itself for a single text:

text = 'Third text with Brand Name'

company = 'Brand Name'

Can I treat company as a one word?

(‘Third text with Brand Name', {“entities”: [(16, 26, 'ORG')]})

Or 2 separated brands ‘Brand’ & ‘Name’ ? (will not be useful in my case while using :( the model later)

(‘Third text with Brand Name', {“entities”: [(16, 21, 'ORG'), (22, 26, 'ORG')]})

Or I should use a different format of labeling eg. BIO ? So Brand will be B-ORG and Name will be I-ORG ?
- IF so can I prepare it like this for Spacy:

(‘Third text with Brand Name', {“entities”: [(16, 21, 'B-ORG'), (22, 26, 'I-ORG')]})

or should I change the format of train_data because I also need the ‘O’ from BIO?
How? Like this? :

(‘Third text with Brand Name', {"entities": ["O", "O", "O", "B-ORG", "I-ORG"]})

The question is on the format of the train_data for ‘Third text with Brand Name' - how to label the entity. If I have the format, I will handle the code. :)

The same question for 3 or more words entities. :)

polm23 · Accepted Answer

You can just provide the start and end offsets for the whole entity. You describe this as "treating it as one word", but the character offsets don't have any direct relation to tokenization - they won't affect tokenizer output.

You will get an error if the start and end of your entity don't match token boundaries, but it doesn't matter if the entity is one token or many.

I recommend you take a look at the training data section in the spaCy docs. Your specific question isn't answered explicitly, but that's only because multi-token entries don't require special treatment. Examples include multi-token entities.

Regarding BIO tagging, for details on how to use it with spaCy you can see the docs for spacy convert.

python Spacy custom NER – how to prepare multi-words entities?

Answers (1)

Related Questions