Reputation: 12847
This is the classic training format.
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
I used to train with code but as I understand, the training is better with CLI train method. However, my format is this.
I have found code-snippets for this type of conversion but every one of them is performing spacy.load('en')
rather than going with blank - which made me think, are they training existing model rather than blank?
This chunk seems pretty easy:
import spacy
from spacy.gold import docs_to_json
import srsly
nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
Running this code throws me: Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
I am quite confused how to use it with spacy train
on blank. Just use spacy.blank('en')
? But then what about disable=["ner"]
flag?
Edit:
If I try spacy.blank('en')
instead, i receive Can't import language goal from spacy.lang: No module named 'spacy.lang.en'
Edit 2:
I have tried loading en_core_web_sm
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
docs.append(doc)
srsly.write_json("ent_train_data.json", [docs_to_json(docs)])
TypeError: object of type 'NoneType' has no len()
Ailton -
print(text[start:end])
Goal! FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - shot with the head from the centre of the box to the centre of the goal. Assist - Ailton -
print(text)
None -
doc.ents =...
lineTypeError: object of type 'NoneType' has no len()
Edit 3: From Ines' comment
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
docs.append(doc)
srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])
This created the json but I don't see any of my tagged entities in the generated json.
Upvotes: 4
Views: 5272
Reputation: 51
import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
doc = nlp(text)
tags = offsets_to_biluo_tags(doc, annot['entities'])
entities = biluo_tags_to_spans(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
As of spaCy v3.1, the above code works. Some relevant methods from spacy.gold
have been renamed and migrated to spacy.training
.
Upvotes: 5
Reputation: 11474
Edit 3 is close, but you're missing a step where you add the entities to the document. This should work:
import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
doc = nlp(text)
tags = biluo_tags_from_offsets(doc, annot['entities'])
entities = spans_from_biluo_tags(doc, tags)
doc.ents = entities
docs.append(doc)
srsly.write_json("spacy_format.json", [docs_to_json(docs)])
It would be good to add a built-in function to do this conversion, since it's common to want to shift from the example scripts (which are just meant to be simple demos) to the train CLI.
Edit:
You can also skip the somewhat indirect use of the built-in BILUO converters and use what you had above:
doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
Upvotes: 7