Reputation: 364
I'm attempting to use spaCy to extract information out of strings like this: GENERAL SYSTEM INFORMATION: 194V MAX DC VOLTAGE MODULES: (19) phonosolar PS310M-24/T 10.36A MAX POWER CURRENT
, not exclusively like this however. The important entities in this are 19: EQUIP_QUANTITY
, phonosolar: EQUIP_MANF
, and PS310M-24/T: EQUIP_MODELNO
. However, the results from training it on a large dataset (10k) samples of things like this is not very good, it misclassifies many entities. It says misaligned entitites will be considered "-" however, when looking at the offset_to_biluo_tags output there are no such entries, there are only O entities and the other marked ones (B_EQUIP_QUANTITY, I_EQUIP_QUANTITY, etc). This warning comes up when I initialize my training samples and turn them into Example objects with the following code.
for thing in tqdm(data):
try:
words = thing['text'].split(" ")
spaces = None
doc = Doc(nlp.vocab, words=words, spaces=spaces)
ex = Example.from_dict(doc, {"words":words,"entities":thing['entities']})
TRAIN_DATA.append(ex)
except:
print("failure")
where thing['text'] is something like the aforementioned string and thing['entities'] is a list of entities in the form of a tuple (startIndex, endIndex (exclusive), type)
Edit: the reason I'm not doing a rule based approach and rather a classification/entity recognition is because the rule based approach is not maintainable in this case, there are very many manufacturers in many formats, very many model numbers in many formats, I would like spacy to pick up on these nuances.
Upvotes: 1
Views: 1083
Reputation: 11484
It's hard to know without a complete example, but if your entity spans are all based on character offsets, I would create examples like this instead so you're not introducing an unnecessary alternative space-based tokenization in the reference doc:
ex = Example.from_dict(nlp.make_doc(text), {"entities":thing['entities']})
Here, it will use the default tokenization from nlp
instead of space-separated tokens, which may help with the alignment errors. If your character offsets don't align with the nlp.make_doc
token boundaries, you can still end up with misalignments, but then you only have one tokenization to worry about instead of two.
Upvotes: 1