Intoxicated Penguin
Intoxicated Penguin

Reputation: 364

spaCy misaligned entities

I'm attempting to use spaCy to extract information out of strings like this: GENERAL SYSTEM INFORMATION: 194V MAX DC VOLTAGE MODULES: (19) phonosolar PS310M-24/T 10.36A MAX POWER CURRENT, not exclusively like this however. The important entities in this are 19: EQUIP_QUANTITY, phonosolar: EQUIP_MANF, and PS310M-24/T: EQUIP_MODELNO. However, the results from training it on a large dataset (10k) samples of things like this is not very good, it misclassifies many entities. It says misaligned entitites will be considered "-" however, when looking at the offset_to_biluo_tags output there are no such entries, there are only O entities and the other marked ones (B_EQUIP_QUANTITY, I_EQUIP_QUANTITY, etc). This warning comes up when I initialize my training samples and turn them into Example objects with the following code.

for thing in tqdm(data):
    try:
        words = thing['text'].split(" ")
        spaces = None
        doc = Doc(nlp.vocab, words=words, spaces=spaces)
        ex = Example.from_dict(doc, {"words":words,"entities":thing['entities']})
        TRAIN_DATA.append(ex)
    except:
        print("failure")

where thing['text'] is something like the aforementioned string and thing['entities'] is a list of entities in the form of a tuple (startIndex, endIndex (exclusive), type)

Edit: the reason I'm not doing a rule based approach and rather a classification/entity recognition is because the rule based approach is not maintainable in this case, there are very many manufacturers in many formats, very many model numbers in many formats, I would like spacy to pick up on these nuances.

Upvotes: 1

Views: 1083

Answers (1)

aab
aab

Reputation: 11484

It's hard to know without a complete example, but if your entity spans are all based on character offsets, I would create examples like this instead so you're not introducing an unnecessary alternative space-based tokenization in the reference doc:

        ex = Example.from_dict(nlp.make_doc(text), {"entities":thing['entities']})

Here, it will use the default tokenization from nlp instead of space-separated tokens, which may help with the alignment errors. If your character offsets don't align with the nlp.make_doc token boundaries, you can still end up with misalignments, but then you only have one tokenization to worry about instead of two.

Upvotes: 1

Related Questions