MehmedB
MehmedB

Reputation: 1137

How to train your model in deeppavlov (NER) Python 3

First of all, sorry for any newbie mistakes that I've made. But I couldn't figure out and couldn't find a source specifically for deeppavlov (NER) library. I'm trying to train ner_ontonotes_bert_mult as described here. I guess it can be trained from its checkpoint to make it recognize some specific patterns like;

"Round 23/22; 24,9 x 12,2 x 12,3"

as

[[['Round', '23/22', ';', '24,9 x 12,2 x 12,3']], [['B-PRODUCT', 'I-PRODUCT', 'B-QUANTITY']]]

My questions are (before I dig into details):

  1. Is it possible? And I realized I can't use samples like " Round 23/22; 24,9 x 12,2 x 12,3 ". I need them to be in full sentences.
  2. Where can I find more info about it specifically related to deeppavlov's model(s)?
  3. How can I train pre-trained deeppavlov model to recognize my custom patterns?

I don't even understand if it is possible but I've decided to give it go and prepared 3 .txt files as "train.txt", "test.txt" and "validation.txt" as described in deeppovlov web page. And I put them under the folder '~/.deeppavlov/downloads/ontonotes/ner_ontonotes_bert_mult'. My dataset looks like this:

Round B-PRODUCT
23/22 I-PRODUCT
24,9 x 12,2 x 12,3 B-QUANTITY
Ring B-PRODUCT
HDFAA I-PRODUCT
12,7 x 10 B-QUANTITY

and so on... This is the code I am trying to train it:

import os
# Force tensorflow to use CPU instead of GPU.
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

config_dict = parse_config(configs.ner.ner_ontonotes_bert_mult)

print(config_dict['dataset_reader']['data_path'])

from deeppavlov import configs, train_model

ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)

But I am getting this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
     [[{{node save/Assign_280}}]]

Full traceback:

2019-09-26 15:50:27.63 ERROR in 'deeppavlov.core.common.params'['params'] at line 110: Exception in <class 'deeppavlov.models.bert.bert_ner.BertNerModel'>
Traceback (most recent call last):
  File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/custom_user/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [3] rhs shape= [37]
     [[{{node save/Assign_280}}]]

UPDATE 2:

And I realized I can't use samples like " Round 23/22; 24,9 x 12,2 x 12,3 ". I need them to be in full sentences.

UPDATE:

It seems like this is happening due to my dataset. My custom dataset only has 3 tags (B-PRODUCT, I-PRODUCT and B-QUANTITY) but the pre-trained model has 37 of them. All available tags can be found here under the sentence of "The list of available tags and their descriptions are presented below.". 18 main tags(with B and I 36 tags), and O tag (“O” means the absence of entity.)). Total of all of the 37 tags needs to be present in the dataset. I was able to pass that error by adding dummy sentences by tagging them all with the missing tags. This is a terrible workaround since I'm willingly disrupting my own data-set. I'm still looking for a 'logical' way to train...

PS: Now I am getting this error.

Traceback (most recent call last):
  File "/home/custom_user/.PyCharm2019.2/config/scratches/scratch_9.py", line 13, in <module>
    ner_model = train_model(configs.ner.ner_ontonotes_bert_mult)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/__init__.py", line 31, in train_model
    train_evaluate_model_from_config(config, download=download, recursive=recursive)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/commands/train.py", line 121, in train_evaluate_model_from_config
    trainer.train(iterator)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 294, in train
    self.train_on_batches(iterator)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 234, in train_on_batches
    self._validate(iterator)
  File "/home/custom_user/.local/lib/python3.6/site-packages/deeppavlov/core/trainers/nn_trainer.py", line 150, in _validate
    metrics = list(report['metrics'].items())
AttributeError: 'NoneType' object has no attribute 'items'

Upvotes: 3

Views: 2575

Answers (2)

Dinesh
Dinesh

Reputation: 151

I tried deeppavlov training, and successfully trained the 'ner' model

I also got the same error at first while training, then I overcome by researching more about it

things to know before training -

-> you can find the 'ner_ontonotes_bert_multi.json' config file link in deeppavlov doc, which gives the dataset path, pretrained model path , dataset_reader and chain pipe to train

-> there is a pretrained model in the directory mentioned in the 'config' ,by default it is inside 'C:/users/{user_name}/.deeppavlov/' is the root directory and pretrained models are gonna store in 'models' subdirectory

-> when you started training the already trained model is gonna be modified which means, training just try to improve the pre-trained model

so to train and build your own model (by scratch), simply delete the 'models' subdirectory from the '.deeppavlov' path and execute the training

Upvotes: 1

Aleksei Lymar
Aleksei Lymar

Reputation: 66

There are at least two problems here:
1. instead of validation.txt there should be a valid.txt file;
2. you are trying to retrain a model that was pretrained on a different dataset with a different set of tags, it's not necessary.

To train your model from scratch you can do something like:

import json
from deeppavlov import configs, build_model, train_model

with configs.ner.ner_ontonotes_bert_mult.open(encoding='utf8') as f:
    ner_config = json.load(f)

ner_config['dataset_reader']['data_path'] = '~/my_data_dir/'  # directory with train.txt, valid.txt and test.txt files
ner_config['metadata']['variables']['NER_PATH'] = '~/where_to_save_the_model/'
ner_config['metadata']['download'] = [ner_config['metadata']['download'][-1]]  # do not download the pretrained ontonotes model

ner_model = train_model(ner_config, download=True)



The other thing that could go wrong is tokenization: "Round 23/22; 24,9 x 12,2 x 12,3" will be split by the model to ['Round', '23', '/', '22', ';', '24', ',', '9', 'x', '12', ',', '2', 'x', '12', ',', '3'] and not ['Round', '23/22', ';', '24,9 x 12,2 x 12,3'].

But you can tokenize your texts beforehand:

ner_model([['Round', '23/22', ';', '24,9 x 12,2 x 12,3']])

Upvotes: 4

Related Questions