Reputation: 39
I really could need some help with creating training data for spacy. I tried many ways in creating training data for spacy. I started with a csv of words and entities, converted them to list of words and entities, putting the words together to lists of sentences and the tags to lists of tags per sentence. I then converted them to the json format. I now have multiple versions of json files that I wanted to convert to the new .spacy format. However, it seems as if no training data works after using --converter ner as it does not find NER format.
I first tried to convert the example to a json file
next_sentence = ""
word_index_in_sentence = 0
start_index = list()
end_index = list()
sent_tags = list()
TRAIN_DATA = []
with open("/content/drive/MyDrive/train_file.json", "w+", encoding="utf-8") as f:
for word_index, word in enumerate(word_list):
if word_index_in_sentence is 0:
start_index.append(0)
else:
start_index.append((end_index[word_index_in_sentence-1])+1)
sent_tags.append(tag_list[word_index])
if word == "." or word == "?" or word == "!" or word_index == len(word_list)-1:
next_sentence += word
end_index.append(start_index[word_index_in_sentence]+1)
entities = "";
for i in range(word_index_in_sentence):
if (i != 0):
entities += ","
entities += "(" + str(start_index[i]) + "," + str(end_index[i]) + "," + "'" + sent_tags[i] + "'" + ")"
f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')
next_sentence = ""
word_index_in_sentence = 0
start_index = list()
end_index = list()
sent_tags = list()
else:
if word_list[word_index + 1] == "," or word_list[word_index + 1] == "." or word_list[word_index + 1] == "!" or word_list[word_index + 1] == "?":
next_sentence += word
end_index.append(start_index[word_index_in_sentence]+len(word)-1)
else:
next_sentence += word + " "
end_index.append(start_index[word_index_in_sentence]+len(word))
word_index_in_sentence += 1
Since this did not work as expected. I then tried to write a list of dicts of dicts. So instead of
f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')
I created a list TRAIN_DATA, adding the values as dict like this:
TRAIN_DATA.append({next_sentence: {"entities":entities}})
saving TRAIN_DATA again to a json-file.
However, when using python -m spacy convert --converter ner /path/to/file /path/to/save
it converts it to .spacy, nevertheless, it states:
⚠ Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert ⚠ No sentence boundaries found to use with option
-n 1
. Use-s
to automatically segment sentences or-n 0
to disable. ⚠ No sentence boundaries found. Use-s
to automatically segment sentences. ⚠ No document delimiters found. Use-n
to automatically group sentences into documents. ✔ Generated output file (1 documents): /content/drive/MyDrive/TRAIN_DATA/hope.spacy
My Training Data is either looking like this after converting to json:
[{"Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.": {"entities": "(0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')"}}, {"welt.de vom 29.10.2005 Firmengründer Wolf Peter Bree arbeitete Anfang der siebziger Jahre als Möbelvertreter, als er einen fliegenden Händler aus dem Libanon traf.": {"entities": "(0,22,'[2005-10-29]'),...
or like this:
[("Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.", {"entities": (0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')}),....
python -m spacy debug data /path/to/config
gives me the output:
⚠ The debug-data command is now available via the 'debug data' subcommand (without the hyphen). You can run python -m spacy debug --help for an overview of the other available debugging commands.
============================ Data file validation ============================ ✔ Corpus is loadable ✔ Pipeline can be initialized with data
=============================== Training stats =============================== Language: de Training pipeline: transformer, ner 1 training docs 1 evaluation docs ✔ No overlap between training and evaluation data ✘ Low number of examples to train a new pipeline (1)
============================== Vocab & Vectors ============================== ℹ 1 total word(s) in the data (1 unique) ℹ No word vectors present in the package
========================== Named Entity Recognition ========================== ℹ 1 label(s) 0 missing value(s) (tokens with '-' label) ⚠ Low number of examples for label 'stamt",{"entities":[(0,51,"O"),(52,67,"B' (1) ⚠ No examples for texts WITHOUT new label 'stamt",{"entities":[(0,51,"O"),(52,67,"B' ✔ No entities consisting of or starting/ending with whitespace ✔ No entities consisting of or starting/ending with punctuation
================================== Summary ================================== ✔ 5 checks passed ⚠ 2 warnings ✘ 1 error
Can someone PLEASE help me to convert my list of words and entities to spacys NER format to train a NER? I would appreciate it. Thank you!
Upvotes: 0
Views: 1385
Reputation: 15593
This is answered in Discussions but your data is not in NER format, nor is it in the json
format used by the converter. It's in a format used for training data, saved to json.
The easiest thing to do in this case is probably to convert your data to columnar IOB data and run the converter on that.
Upvotes: 1