Reputation: 141
I'm using INCEpTION to annotate Named Entities which I want to use to train a model with spaCy. There are several options (e.g. CoNLL 2000, CoNLL CoreNLP, CoNLL-U) in INCEpTION to export the annotated text. I have exported the file as CoNLL-U and I want to convert it to json since this file format is required to train spaCy's NER module. Someone has asked a similar question but the answer doesn't help me (here).
This is the annotated test text that I am using
spaCy's convert script is:
python -m spacy convert [input_file] [output_dir] [--file-type] [--converter]
[--n-sents] [--morphology] [--lang]
My first problem is, that I can't convert the file to .json. When I use the code below, I only get an output without any Named Entities (see last output):
!python -m spacy convert Teest.conllu
I also tried to add a output path and json
!python -m spacy convert Teest.conllu C:\Users json
But then I get the following error:
usage: spacy convert [-h] [-t json] [-n 1] [-s] [-b None] [-m] [-c auto]
[-l None]
input_file [output_dir]
spacy convert: error: unrecognized arguments: Users json
My second problem is, that the output does not contain any Named Entities, nor start and end index:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Hallo",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":1,
"orth":",",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":2,
"orth":"dies",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":3,
"orth":"ist",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":4,
"orth":"ein",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":5,
"orth":"Test",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":6,
"orth":"um",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":7,
"orth":"zu",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":8,
"orth":"schauen",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":9,
"orth":",",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":10,
"orth":"wie",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":11,
"orth":"in",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":12,
"orth":"Inception",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":13,
"orth":"annotiert",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":14,
"orth":"wird",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":15,
"orth":".",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
},
{
"id":1,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Funktioniert",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":1,
"orth":"es",
"tag":"_",
"head":0,
"dep":"_"
},
{
"id":2,
"orth":"?",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
},
{
"id":2,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Simon",
"tag":"_",
"head":0,
"dep":"_"
}
]
}
]
}
]
}
]
I am using spaCy version 2.3.0 and Python version 3.8.3.
UPDATE: I have used a new file since I wanted to find out if there are any issues with the language. When I'm exporting the file as CoNNL Core NLP, the file contains Named entities:
1 I'm _ _ O _ _
2 trying _ _ Verb _ _
3 to _ _ O _ _
4 fix _ _ Verb _ _
5 some _ _ O _ _
6 problems _ _ O _ _
7 . _ _ O _ _
1 But _ _ O _ _
2 why _ _ O _ _
3 it _ _ O _ _
4 this _ _ O _ _
5 not _ _ O _ _
6 working _ _ Verb _ _
7 ? _ _ O _ _
1 Simon _ _ Name _ _
However, wen I try to comvert the CoNNL Core NLP file with
!python -m spacy convert Teest.conll
the error
line 68, in read_conllx
id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts
ValueError: not enough values to unpack (expected 10, got 7)
shows up.
UPDATE: By adding 3 more lines of tab separated "_" before the ner the conversion works. The output is:
[
{
"id":0,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"I'm",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":1,
"orth":"trying",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":2,
"orth":"to",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":3,
"orth":"fix",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":4,
"orth":"some",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":5,
"orth":"problems",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":6,
"orth":".",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
}
]
}
]
}
]
},
{
"id":1,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"But",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":1,
"orth":"why",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":2,
"orth":"it",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":3,
"orth":"this",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":4,
"orth":"not",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
},
{
"id":5,
"orth":"working",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-rb"
},
{
"id":6,
"orth":"?",
"tag":"_",
"head":0,
"dep":"_",
"ner":"O"
}
]
}
]
}
]
},
{
"id":2,
"paragraphs":[
{
"sentences":[
{
"tokens":[
{
"id":0,
"orth":"Simon",
"tag":"_",
"head":0,
"dep":"_",
"ner":"U-me"
}
]
}
]
}
]
}
]
Still, I can't convert this directly to a .json file and as far as I know, tuples are required to train spaCy's NER module. E.g.:
[('Berlin is a city.', {'entities': (0, 5, 'LOC'), (7, 8, 'VERB'), (12, 15, 'NOUN')]})]
Upvotes: 5
Views: 2336
Reputation: 141
I have found a solution to use INCEpTION as an annotation tool to train spaCy's NER module. I have tried various file formats but in my opinion, it is only possible with CoNLL 2002 and using spaCy at the Command Line Interface.
python -m venv .venv
.venv\Scripts\activate.bat
pip install spacy
python -m spacy download en_core_web_lg
python -m spacy convert --converter ner file_name.conll [output file direction]
This step shouldn't work since CoNLL 2002 uses IOB2 and spaCy's converter requires IOB. However, I didn't have any problems and the .json output file is correct.
Here is a pretty good example how you can process with the converted file.
Upvotes: 1
Reputation: 401
I understand your pain.
You need to manually write a script to convert your last output into a spacy formatted output.
A better solution would be to use the spacy-annotator which allows you to annotate entitites and get an output in a format that spaCy likes. Here is how it looks like:
Upvotes: 0