Reputation: 4180
I would like to do some evaluation of spaCy's pretrained models with the wikiner datasets. However, these datasets are in json format, using the BILUO annotation scheme. I know that I can do evaluation in the command-line interface, but I would like to do it in the Python interpreter instead, which requires a different data format, as shown below.
TRAIN_DATA = [("Dogs are loyal", {'entities': [(0, 4, 'ANIMAL)]})]
I wonder if there is a way to convert the BILUO scheme json formatted data into the format below. OR alternatively, is it possible to directly evaluate data that is in json format (e.g., I could read json files into the Python interpreter.)
Thanks!
EDIT: Added sample json data set
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"orth":"Zum",
"tag":"-",
"ner":"O"
},
{
"orth":"1.",
"tag":"-",
"ner":"O"
},
{
"orth":"Januar",
"tag":"-",
"ner":"O"
},
{
"orth":"1994",
"tag":"-",
"ner":"O"
},
{
"orth":"wird",
"tag":"-",
"ner":"O"
},
{
"orth":"Ruppendorf",
"tag":"-",
"ner":"U-LOC"
},
{
"orth":"nach",
"tag":"-",
"ner":"O"
},
{
"orth":"H\u00f6ckendorf",
"tag":"-",
"ner":"U-LOC"
},
{
"orth":"eingemeindet",
"tag":"-",
"ner":"O"
},
{
"orth":".",
"tag":"-",
"ner":"O"
}
]
}
]
}
]
},
Upvotes: 0
Views: 3253
Reputation: 43
For Spacy3 several changes are made. I would like to update the import and code here. The input data are no long json format but binary spacy format. So it is to convert train.spacy to the desired format.
import spacy
from spacy.training import Corpus, biluo_tags_to_offsets
nlp = spacy.load("de_core_news_sm")
corpus = Corpus(path/to/train.spacy)
train_data = corpus(nlp)
for example in train_data:
print(example.text)
print(biluo_tags_to_offsets(example.reference, example.get_aligned_ner()))
Here are some docs related to Spacy Example ,Spacy biluo and Corpus.
Upvotes: 0
Reputation: 11474
An initial caveat: you are probably aware of this, but many of spacy's non-English NER models are trained on WikiNER, so be aware that you might accidentally be evaluating on the training data, which obviously won't give you a good picture of how well the model works.
If you have spacy's internal JSON training format with BILUO NER tags and you would like to have entity spans referenced by character offsets, you can load the data with GoldCorpus
and convert it to offsets with spacy.gold.offsets_from_biluo_tags
. Note that with this kind of input with no provided raw
text for each paragraph you will have a space between each token when counting character offsets.
import spacy
from spacy.gold import GoldCorpus, offsets_from_biluo_tags
nlp = spacy.load('de')
goldcorpus = GoldCorpus("/path/to/train.json", "/path/to/train.json")
train_docs = goldcorpus.train_docs(nlp)
for doc, gold in train_docs:
print(doc.text)
print(offsets_from_biluo_tags(doc, gold.ner))
Output:
Zum 1. Januar 1994 wird Ruppendorf nach Höckendorf eingemeindet .
[(24, 34, 'LOC'), (40, 50, 'LOC')]
Notes:
GoldCorpus.train_docs()
needs the nlp
model in order to handle cases where the tokenization in your corpus vs. spacy are not the same.GoldCorpus
always expects to have both train and dev data provided as GoldCorpus(train_path, dev_path)
, so loading the train data for both doesn't cause any problems as long as you're not using the dev data for anything.Upvotes: 1