darned7
darned7

Reputation: 39

NER: Defining Train Data for Spacy v3

I really could need some help with creating training data for spacy. I tried many ways in creating training data for spacy. I started with a csv of words and entities, converted them to list of words and entities, putting the words together to lists of sentences and the tags to lists of tags per sentence. I then converted them to the json format. I now have multiple versions of json files that I wanted to convert to the new .spacy format. However, it seems as if no training data works after using --converter ner as it does not find NER format.

I first tried to convert the example to a json file

next_sentence = ""
word_index_in_sentence = 0
start_index = list()
end_index = list()
sent_tags = list()
TRAIN_DATA = []
with open("/content/drive/MyDrive/train_file.json", "w+", encoding="utf-8") as f:
    for word_index, word in enumerate(word_list):
        if word_index_in_sentence is 0:
            start_index.append(0)
        else:
            start_index.append((end_index[word_index_in_sentence-1])+1)

        sent_tags.append(tag_list[word_index])

        if word == "." or word == "?" or word == "!" or word_index == len(word_list)-1:
            next_sentence += word
            end_index.append(start_index[word_index_in_sentence]+1)
            entities = "";
            for i in range(word_index_in_sentence):
                if (i != 0):
                    entities += ","
                entities += "(" + str(start_index[i]) + "," + str(end_index[i]) + "," + "'" + sent_tags[i] + "'" + ")"

            f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')
            next_sentence = ""
            word_index_in_sentence = 0
            start_index = list()
            end_index = list()
            sent_tags = list()
        else:
            if word_list[word_index + 1] == "," or word_list[word_index + 1] == "." or word_list[word_index + 1] == "!" or word_list[word_index + 1] == "?":
                next_sentence += word
                end_index.append(start_index[word_index_in_sentence]+len(word)-1)
            else:
                next_sentence += word + " "
                end_index.append(start_index[word_index_in_sentence]+len(word))
            word_index_in_sentence += 1

Since this did not work as expected. I then tried to write a list of dicts of dicts. So instead of

f.write('("' + next_sentence + '",{"entities": [' + entities + ']}),')

I created a list TRAIN_DATA, adding the values as dict like this:

TRAIN_DATA.append({next_sentence: {"entities":entities}})

saving TRAIN_DATA again to a json-file.

However, when using python -m spacy convert --converter ner /path/to/file /path/to/save it converts it to .spacy, nevertheless, it states:

⚠ Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert ⚠ No sentence boundaries found to use with option -n 1. Use -s to automatically segment sentences or -n 0 to disable. ⚠ No sentence boundaries found. Use -s to automatically segment sentences. ⚠ No document delimiters found. Use -n to automatically group sentences into documents. ✔ Generated output file (1 documents): /content/drive/MyDrive/TRAIN_DATA/hope.spacy

My Training Data is either looking like this after converting to json:

[{"Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.": {"entities": "(0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')"}}, {"welt.de vom 29.10.2005 Firmengründer Wolf Peter Bree arbeitete Anfang der siebziger Jahre als Möbelvertreter, als er einen fliegenden Händler aus dem Libanon traf.": {"entities": "(0,22,'[2005-10-29]'),...

or like this:

[("Schartau sagte dem Tagesspiegel vom Freitag, Fischer sei in einer Weise aufgetreten, die alles andere als überzeugend war.", {"entities": (0,8,'B-PER'),(9,14,'O'),(15,18,'O'),(19,31,'B-ORG'),(32,35,'O'),(36,42,'O'),(43,44,'O'),(45,52,'B-PER'),(53,56,'O'),(57,59,'O'),(60,65,'O'),(66,71,'O'),(72,82,'O'),(83,84,'O'),(85,88,'O'),(89,94,'O'),(95,101,'O'),(102,105,'O'),(106,117,'O'),(118,120,'O')}),....

python -m spacy debug data /path/to/config

gives me the output:

⚠ The debug-data command is now available via the 'debug data' subcommand (without the hyphen). You can run python -m spacy debug --help for an overview of the other available debugging commands.

============================ Data file validation ============================ ✔ Corpus is loadable ✔ Pipeline can be initialized with data

=============================== Training stats =============================== Language: de Training pipeline: transformer, ner 1 training docs 1 evaluation docs ✔ No overlap between training and evaluation data ✘ Low number of examples to train a new pipeline (1)

============================== Vocab & Vectors ============================== ℹ 1 total word(s) in the data (1 unique) ℹ No word vectors present in the package

========================== Named Entity Recognition ========================== ℹ 1 label(s) 0 missing value(s) (tokens with '-' label) ⚠ Low number of examples for label 'stamt",{"entities":[(0,51,"O"),(52,67,"B' (1) ⚠ No examples for texts WITHOUT new label 'stamt",{"entities":[(0,51,"O"),(52,67,"B' ✔ No entities consisting of or starting/ending with whitespace ✔ No entities consisting of or starting/ending with punctuation

================================== Summary ================================== ✔ 5 checks passed ⚠ 2 warnings ✘ 1 error

Can someone PLEASE help me to convert my list of words and entities to spacys NER format to train a NER? I would appreciate it. Thank you!

Upvotes: 0

Views: 1385

Answers (1)

polm23
polm23

Reputation: 15593

This is answered in Discussions but your data is not in NER format, nor is it in the json format used by the converter. It's in a format used for training data, saved to json.

The easiest thing to do in this case is probably to convert your data to columnar IOB data and run the converter on that.

Upvotes: 1

Related Questions