spaCy: Is it possible to convert json format (with BILUO scheme) files to list format that is used for training in Python?

Question

I would like to do some evaluation of spaCy's pretrained models with the wikiner datasets. However, these datasets are in json format, using the BILUO annotation scheme. I know that I can do evaluation in the command-line interface, but I would like to do it in the Python interpreter instead, which requires a different data format, as shown below.

TRAIN_DATA = [("Dogs are loyal", {'entities': [(0, 4, 'ANIMAL)]})]

I wonder if there is a way to convert the BILUO scheme json formatted data into the format below. OR alternatively, is it possible to directly evaluate data that is in json format (e.g., I could read json files into the Python interpreter.)

Thanks!

EDIT: Added sample json data set

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Zum",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"1.",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"Januar",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"1994",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"wird",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"Ruppendorf",
                "tag":"-",
                "ner":"U-LOC"
              },
              {
                "orth":"nach",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"H\u00f6ckendorf",
                "tag":"-",
                "ner":"U-LOC"
              },
              {
                "orth":"eingemeindet",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":".",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },

aab · Accepted Answer

An initial caveat: you are probably aware of this, but many of spacy's non-English NER models are trained on WikiNER, so be aware that you might accidentally be evaluating on the training data, which obviously won't give you a good picture of how well the model works.

If you have spacy's internal JSON training format with BILUO NER tags and you would like to have entity spans referenced by character offsets, you can load the data with GoldCorpus and convert it to offsets with spacy.gold.offsets_from_biluo_tags. Note that with this kind of input with no provided raw text for each paragraph you will have a space between each token when counting character offsets.

import spacy
from spacy.gold import GoldCorpus, offsets_from_biluo_tags

nlp = spacy.load('de')
goldcorpus = GoldCorpus("/path/to/train.json", "/path/to/train.json")

train_docs = goldcorpus.train_docs(nlp)
for doc, gold in train_docs:
    print(doc.text)
    print(offsets_from_biluo_tags(doc, gold.ner))

Output:

Zum 1. Januar 1994 wird Ruppendorf nach Höckendorf eingemeindet .
[(24, 34, 'LOC'), (40, 50, 'LOC')]

Notes:

GoldCorpus.train_docs() needs the nlp model in order to handle cases where the tokenization in your corpus vs. spacy are not the same.
GoldCorpus always expects to have both train and dev data provided as GoldCorpus(train_path, dev_path), so loading the train data for both doesn't cause any problems as long as you're not using the dev data for anything.

spaCy: Is it possible to convert json format (with BILUO scheme) files to list format that is used for training in Python?

Answers (2)

Related Questions