nomisjon
nomisjon

Reputation: 141

How to convert in INCEpTION annotated text NER for spaCy? (CoNLL-U to json)

I'm using INCEpTION to annotate Named Entities which I want to use to train a model with spaCy. There are several options (e.g. CoNLL 2000, CoNLL CoreNLP, CoNLL-U) in INCEpTION to export the annotated text. I have exported the file as CoNLL-U and I want to convert it to json since this file format is required to train spaCy's NER module. Someone has asked a similar question but the answer doesn't help me (here).

This is the annotated test text that I am using

spaCy's convert script is:

python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
[--n-sents] [--morphology] [--lang]

My first problem is, that I can't convert the file to .json. When I use the code below, I only get an output without any Named Entities (see last output):

!python -m spacy convert Teest.conllu

I also tried to add a output path and json

!python -m spacy convert Teest.conllu C:\Users json

But then I get the following error:

usage: spacy convert [-h] [-t json] [-n 1] [-s] [-b None] [-m] [-c auto]
                     [-l None]
                     input_file [output_dir]
spacy convert: error: unrecognized arguments: Users json

My second problem is, that the output does not contain any Named Entities, nor start and end index:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Hallo",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":1,
                "orth":",",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":2,
                "orth":"dies",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":3,
                "orth":"ist",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":4,
                "orth":"ein",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":5,
                "orth":"Test",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":6,
                "orth":"um",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":7,
                "orth":"zu",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":8,
                "orth":"schauen",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":9,
                "orth":",",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":10,
                "orth":"wie",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":11,
                "orth":"in",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":12,
                "orth":"Inception",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":13,
                "orth":"annotiert",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":14,
                "orth":"wird",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":15,
                "orth":".",
                "tag":"_",
                "head":0,
                "dep":"_"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Funktioniert",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":1,
                "orth":"es",
                "tag":"_",
                "head":0,
                "dep":"_"
              },
              {
                "id":2,
                "orth":"?",
                "tag":"_",
                "head":0,
                "dep":"_"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":2,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Simon",
                "tag":"_",
                "head":0,
                "dep":"_"
              }
            ]
          }
        ]
      }
    ]
  }
]

I am using spaCy version 2.3.0 and Python version 3.8.3.

UPDATE: I have used a new file since I wanted to find out if there are any issues with the language. When I'm exporting the file as CoNNL Core NLP, the file contains Named entities:

1   I'm         _   _   O       _   _
2   trying      _   _   Verb    _   _
3   to          _   _   O       _   _
4   fix         _   _   Verb    _   _
5   some        _   _   O       _   _
6   problems    _   _   O       _   _
7   .           _   _   O       _   _
    
1   But         _   _   O       _   _
2   why         _   _   O       _   _
3   it          _   _   O       _   _
4   this        _   _   O       _   _
5   not         _   _   O       _   _
6   working     _   _   Verb    _   _
7   ?           _   _   O       _   _

1   Simon       _   _   Name    _   _

However, wen I try to comvert the CoNNL Core NLP file with

!python -m spacy convert Teest.conll

the error

line 68, in read_conllx
    id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: not enough values to unpack (expected 10, got 7)

shows up.

UPDATE: By adding 3 more lines of tab separated "_" before the ner the conversion works. The output is:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"I'm",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":1,
                "orth":"trying",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-rb"
              },
              {
                "id":2,
                "orth":"to",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":3,
                "orth":"fix",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-rb"
              },
              {
                "id":4,
                "orth":"some",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":5,
                "orth":"problems",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":6,
                "orth":".",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"But",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":1,
                "orth":"why",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":2,
                "orth":"it",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":3,
                "orth":"this",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":4,
                "orth":"not",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              },
              {
                "id":5,
                "orth":"working",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-rb"
              },
              {
                "id":6,
                "orth":"?",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":2,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"Simon",
                "tag":"_",
                "head":0,
                "dep":"_",
                "ner":"U-me"
              }
            ]
          }
        ]
      }
    ]
  }
]

Still, I can't convert this directly to a .json file and as far as I know, tuples are required to train spaCy's NER module. E.g.:

[('Berlin is a city.', {'entities': (0, 5, 'LOC'), (7, 8, 'VERB'), (12, 15, 'NOUN')]})]

Upvotes: 5

Views: 2336

Answers (2)

nomisjon
nomisjon

Reputation: 141

I have found a solution to use INCEpTION as an annotation tool to train spaCy's NER module. I have tried various file formats but in my opinion, it is only possible with CoNLL 2002 and using spaCy at the Command Line Interface.

  1. Annotate the text in INCEpTION
  2. Export the annotated text as a CoNLL 2002 file
  3. Setup the Command Line Interface (here Windows is used)
python -m venv .venv
.venv\Scripts\activate.bat
  1. Install spaCy if necessary and download the required language model (I'm using the large english model)
pip install spacy
python -m spacy download en_core_web_lg
  1. Convert the CoNLL 2002 to spaCy's required input format
python -m spacy convert --converter ner file_name.conll [output file direction]

This step shouldn't work since CoNLL 2002 uses IOB2 and spaCy's converter requires IOB. However, I didn't have any problems and the .json output file is correct.

  1. Debug-Data tool, training and evaluation

Here is a pretty good example how you can process with the converted file.

Upvotes: 1

iEriii
iEriii

Reputation: 401

I understand your pain.
You need to manually write a script to convert your last output into a spacy formatted output.

A better solution would be to use the spacy-annotator which allows you to annotate entitites and get an output in a format that spaCy likes. Here is how it looks like:

enter image description here

Upvotes: 0

Related Questions