user12541823
user12541823

Reputation:

Warning: [W030] Some entities could not be aligned in the text

TRAIN_DATA = [
    ("XYZxyzg hat die beste Camera für Selfies", {"entities": [(0, 7, "BRAND"), (23, 28, "CAMERA")]}),
]

Upon training this, I keep getting an error on this line that:

serWarning: [W030] Some entities could not be aligned in the text "XYZxyzg hat die beste Camera für Selfie" with entities "[(0, 7, 'BRAND'), (23, 28, 'CAMERA')]". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  gold = GoldParse(doc, **gold)

What's wrong with my indexes? Should I exclude whitespaces? I tried that too but it doesn't seem to work. How can I use spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities) to check indexes as the warning suggests?

Upvotes: 3

Views: 5546

Answers (2)

Dan Taninecz Miller
Dan Taninecz Miller

Reputation: 121

spacy.gold.biluo_tags_from_offsets has been deprecated, I believe.

You can replace spacy.gold import biluo_tags_from_offsets with spacy.training import offsets_to_biluo_tags

https://spacy.io/api/top-level#offsets_to_biluo_tags

Upvotes: 3

Sofie VL
Sofie VL

Reputation: 3106

From your post:

TRAIN_DATA = [
    ("XYZxyzg hat die beste Camera für Selfies", {"entities": [(0, 7, "BRAND"), (23, 28, "CAMERA")]}),
]

The entity offsets need to align to token boundaries. You can't start/end an entity in the middle of a token. In your case, it looks like a small error crept in, and I think the offsets of the second entity should be (22, 28, "CAMERA") instead.

Upvotes: 2

Related Questions