Alex
Alex

Reputation: 4180

spaCy: Is it possible to convert json format (with BILUO scheme) files to list format that is used for training in Python?

I would like to do some evaluation of spaCy's pretrained models with the wikiner datasets. However, these datasets are in json format, using the BILUO annotation scheme. I know that I can do evaluation in the command-line interface, but I would like to do it in the Python interpreter instead, which requires a different data format, as shown below.

TRAIN_DATA = [("Dogs are loyal", {'entities': [(0, 4, 'ANIMAL)]})]

I wonder if there is a way to convert the BILUO scheme json formatted data into the format below. OR alternatively, is it possible to directly evaluate data that is in json format (e.g., I could read json files into the Python interpreter.)

Thanks!

EDIT: Added sample json data set

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Zum",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"1.",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"Januar",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"1994",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"wird",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"Ruppendorf",
                "tag":"-",
                "ner":"U-LOC"
              },
              {
                "orth":"nach",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":"H\u00f6ckendorf",
                "tag":"-",
                "ner":"U-LOC"
              },
              {
                "orth":"eingemeindet",
                "tag":"-",
                "ner":"O"
              },
              {
                "orth":".",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },

Upvotes: 0

Views: 3253

Answers (2)

Xia
Xia

Reputation: 43

For Spacy3 several changes are made. I would like to update the import and code here. The input data are no long json format but binary spacy format. So it is to convert train.spacy to the desired format.

import spacy
from spacy.training import Corpus, biluo_tags_to_offsets


nlp = spacy.load("de_core_news_sm")

corpus = Corpus(path/to/train.spacy)

train_data = corpus(nlp)

for example in  train_data:
    print(example.text)
    print(biluo_tags_to_offsets(example.reference, example.get_aligned_ner()))

Here are some docs related to Spacy Example ,Spacy biluo and Corpus.

Upvotes: 0

aab
aab

Reputation: 11474

An initial caveat: you are probably aware of this, but many of spacy's non-English NER models are trained on WikiNER, so be aware that you might accidentally be evaluating on the training data, which obviously won't give you a good picture of how well the model works.

If you have spacy's internal JSON training format with BILUO NER tags and you would like to have entity spans referenced by character offsets, you can load the data with GoldCorpus and convert it to offsets with spacy.gold.offsets_from_biluo_tags. Note that with this kind of input with no provided raw text for each paragraph you will have a space between each token when counting character offsets.

import spacy
from spacy.gold import GoldCorpus, offsets_from_biluo_tags

nlp = spacy.load('de')
goldcorpus = GoldCorpus("/path/to/train.json", "/path/to/train.json")

train_docs = goldcorpus.train_docs(nlp)
for doc, gold in train_docs:
    print(doc.text)
    print(offsets_from_biluo_tags(doc, gold.ner))

Output:

Zum 1. Januar 1994 wird Ruppendorf nach Höckendorf eingemeindet .
[(24, 34, 'LOC'), (40, 50, 'LOC')]

Notes:

  • GoldCorpus.train_docs() needs the nlp model in order to handle cases where the tokenization in your corpus vs. spacy are not the same.
  • GoldCorpus always expects to have both train and dev data provided as GoldCorpus(train_path, dev_path), so loading the train data for both doesn't cause any problems as long as you're not using the dev data for anything.

Upvotes: 1

Related Questions