NER: Defining Train Data for Spacy v3

Question

I really could need some help with creating training data for spacy. I tried many ways in creating training data for spacy. I started with a csv of words and entities, converted them to list of words and entities, putting the words together to lists of sentences and the tags to lists of tags per sentence. I then converted them to the json format. I now have multiple versions of json files that I wanted to convert to the new .spacy format. However, it seems as if no training data works after using --converter ner as it does not find NER format.

I first tried to convert the example to a json file

next_sentence = ""
word_index_in_sentence = 0
start_index = list()
end_index = list()
sent_tags = list()
TRAIN_DATA = []
with open("/content/drive/MyDrive/train_file.json", "w+", encoding="utf-8") as f:
    for word_index, word in enumerate(word_list):
        if word_index_in_sentence is 0:
            start_index.append(0)
        else:
            start_index.append((end_index[word_index_in_sentence-1])+1)

        sent_tags.append(tag_list[word_index])

        if word == "." or word == "?" or word == "!" or word_index == len(word_list)-1:
            next_sentence += word
            end_index.append(start_index[word_index_in_sentence]+1)
            entities = "";
            for i in range(word_index_in_sentence):
                if (i != 0):
                    entities += ","
                entities += "(" + str(start_index[i]) + "," + str(end_index[i]) + "," + "'" + sent_tags[i] + "'" + ")"

            f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')
            next_sentence = ""
            word_index_in_sentence = 0
            start_index = list()
            end_index = list()
            sent_tags = list()
        else:
            if word_list[word_index + 1] == "," or word_list[word_index + 1] == "." or word_list[word_index + 1] == "!" or word_list[word_index + 1] == "?":
                next_sentence += word
                end_index.append(start_index[word_index_in_sentence]+len(word)-1)
            else:
                next_sentence += word + " "
                end_index.append(start_index[word_index_in_sentence]+len(word))
            word_index_in_sentence += 1

Since this did not work as expected. I then tried to write a list of dicts of dicts. So instead of

f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')

I created a list TRAIN_DATA, adding the values as dict like this:

TRAIN_DATA.append({next_sentence: {"entities":entities}})

saving TRAIN_DATA again to a json-file.

However, when using python -m spacy convert --converter ner /path/to/file /path/to/save it converts it to .spacy, nevertheless, it states:

⚠ Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert ⚠ No sentence boundaries found to use with option -n 1. Use -s to automatically segment sentences or -n 0 to disable. ⚠ No sentence boundaries found. Use -s to automatically segment sentences. ⚠ No document delimiters found. Use -n to automatically group sentences into documents. ✔ Generated output file (1 documents): /content/drive/MyDrive/TRAIN_DATA/hope.spacy

My Training Data is either looking like this after converting to json:

[{"Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.": {"entities": "(0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')"}}, {"welt.de vom 29.10.2005 Firmengründer Wolf Peter Bree arbeitete Anfang der siebziger Jahre als Möbelvertreter, als er einen fliegenden Händler aus dem Libanon traf.": {"entities": "(0,22,'[2005-10-29]'),...

or like this:

[("Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.", {"entities": (0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')}),....

python -m spacy debug data /path/to/config

gives me the output:

⚠ The debug-data command is now available via the 'debug data' subcommand (without the hyphen). You can run python -m spacy debug --help for an overview of the other available debugging commands.

============================ Data file validation ============================ ✔ Corpus is loadable ✔ Pipeline can be initialized with data

=============================== Training stats =============================== Language: de Training pipeline: transformer, ner 1 training docs 1 evaluation docs ✔ No overlap between training and evaluation data ✘ Low number of examples to train a new pipeline (1)

============================== Vocab & Vectors ============================== ℹ 1 total word(s) in the data (1 unique) ℹ No word vectors present in the package

========================== Named Entity Recognition ========================== ℹ 1 label(s) 0 missing value(s) (tokens with '-' label) ⚠ Low number of examples for label 'stamt",{"entities":[(0,51,"O"),(52,67,"B' (1) ⚠ No examples for texts WITHOUT new label 'stamt",{"entities":[(0,51,"O"),(52,67,"B' ✔ No entities consisting of or starting/ending with whitespace ✔ No entities consisting of or starting/ending with punctuation

================================== Summary ================================== ✔ 5 checks passed ⚠ 2 warnings ✘ 1 error

Can someone PLEASE help me to convert my list of words and entities to spacys NER format to train a NER? I would appreciate it. Thank you!

NER: Defining Train Data for Spacy v3

Answers (1)

Related Questions