spaCy misaligned entities

Question

I'm attempting to use spaCy to extract information out of strings like this: GENERAL SYSTEM INFORMATION: 194V MAX DC VOLTAGE MODULES: (19) phonosolar PS310M-24/T 10.36A MAX POWER CURRENT, not exclusively like this however. The important entities in this are 19: EQUIP_QUANTITY, phonosolar: EQUIP_MANF, and PS310M-24/T: EQUIP_MODELNO. However, the results from training it on a large dataset (10k) samples of things like this is not very good, it misclassifies many entities. It says misaligned entitites will be considered "-" however, when looking at the offset_to_biluo_tags output there are no such entries, there are only O entities and the other marked ones (B_EQUIP_QUANTITY, I_EQUIP_QUANTITY, etc). This warning comes up when I initialize my training samples and turn them into Example objects with the following code.

for thing in tqdm(data):
    try:
        words = thing['text'].split(" ")
        spaces = None
        doc = Doc(nlp.vocab, words=words, spaces=spaces)
        ex = Example.from_dict(doc, {"words":words,"entities":thing['entities']})
        TRAIN_DATA.append(ex)
    except:
        print("failure")

where thing['text'] is something like the aforementioned string and thing['entities'] is a list of entities in the form of a tuple (startIndex, endIndex (exclusive), type)

Edit: the reason I'm not doing a rule based approach and rather a classification/entity recognition is because the rule based approach is not maintainable in this case, there are very many manufacturers in many formats, very many model numbers in many formats, I would like spacy to pick up on these nuances.

spaCy misaligned entities

Answers (1)

Related Questions