Reputation: 21
This is my first time asking a question so please let me know if there's more information that you might need.
I have a spacy doc and a list of tags that looks like ['O', 'O', 'PERSON','O','GPE',...]
and would like to edit the entity labels in the doc object to match the tags.
I understand that the doc consists of tokens and also entities Doc.ents
, but there seems to be a lot of different components where if I change one artificially it might break the integrity of the Doc.
My question is, what would be the best way to go about this? Is there a certain constructor I can use or a method? Can I just change the label directly?
Thanks!
Upvotes: 2
Views: 492
Reputation: 15593
Normally the easiest way to set entities from IOB tags besides using a file converter is to use the Doc constructor. But it looks like your tags don't have the IOB part (they should have like B-PERSON
, not just PERSON
), so that won't work.
You can set the entity label correctly, but you can't set the IOB tag directly, and without updates to the IOB tags the entities won't be recognized correctly.
Since it sounds like these tags are coming from outside spaCy, tokenization is also an issue. Are you sure the tokens will always align with spaCy tokens?
If alignment is not an issue, and you never have the same entity occur twice in a row with nothing in between, I guess you could automatically convert your labels to BIO just by making the first item B and the second item O. If you do that then you can use the Doc constructor.
Example:
original: O PERSON PERSON O O
cleaned: O B-PERSON I-PERSON O O
Upvotes: 2