Reputation:
I have created a spacy transformer model for named entity recognition. Last time I trained till it reached 90% accuracy and I also have a model-best
directory from where I can load my trained model for predictions. But now I have some more data samples and I wish to resume training this spacy transformer. I saw that we can do it by changing the config.cfg
but clueless about 'what to change?'
This is my config.cfg
after running python -m spacy init fill-config ./base_config.cfg ./config.cfg
:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null
[system]
gpu_allocator = "pytorch"
seed = 0
[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100
[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"
[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
mixed_precision = false
[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
[components.transformer.model.grad_scaler_config]
[components.transformer.model.tokenizer_config]
use_fast = true
[components.transformer.model.transformer_config]
[corpora]
[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null
[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005
[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null
[pretraining]
[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
[initialize.tokenizer]
As you can see there is a 'vectors' parameter under [initialize]
so I tried giving vectors from 'model-best' like this:
But it gave me this error
OSError: [E884] The pipeline could not be initialized because the vectors could not be found at './model-best/ner'. If your pipeline was already initialized/trained before, call 'resume_training' instead of 'initialize', or initialize only the components that are new.
For those who are wondering that I have been given the wrong path. No, that directory exists. You can see directory structure,
So, please guide me on how I can successfully resume the training from previous weights.
Thank you!
Upvotes: 5
Views: 2628
Reputation: 146
Vector sizes refer to word vectors here. To use the vocabulary from the previously trained Spacy pipeline, you can use the following the code:
[components.ner]
source = "/path/to/model-best"
[initialize]
vectors = ${paths.vectors}
[initialize.before_init]
@callbacks: "spacy.copy_from_base_model.v1"
tokenizer: "/path/to/model-best"
vocab: "/path/to/model-best"
Upvotes: 0
Reputation: 11484
The vectors setting is not related to the transformer
or what you're trying to do.
In the new config, you want to use the source
option to load the components from the existing pipeline. You would modify the [component]
blocks to contain only the source
setting and no other settings:
[components.ner]
source = "/path/to/model-best"
[components.transformer]
source = "/path/to/model-best"
See: https://spacy.io/usage/training#config-components
Upvotes: 4