How to make spacy train faster on NER for Persian language

I have a blank model from spacy, in the config file I use the widget Training Pipelines & Models with this config:

Language = Arabic
Components = ner
Hardware = CPU
Optimize for = accuracy

then in config-file I changed the:

[nlp]
lang = "ar"

[nlp]
lang = "fa"

because there is no pretrained GPU (transformer) for persian-language.

and as you know the accuracy type is very slow and I have 400,000 records.

this is my config-file

[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = null

[nlp]
lang = "fa"
pipeline = ["tok2vec","ner"]
batch_size = 1000

[components]

[components.tok2vec]
factory = "tok2vec"

[components.tok2vec.model]
@architectures = "spacy.Tok2Vec.v2"

[components.tok2vec.model.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true

[components.tok2vec.model.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3

[components.ner]
factory = "ner"

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}

[corpora]

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"

[training.optimizer]
@optimizers = "Adam.v1"

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

[initialize]
vectors = ${paths.vectors}

How can I make the training process faster?

Upvotes: 3

Answers (2)

polm23

Reputation: 15623

To speed up training you have a few options.

Change the evaluation frequency. It's not in the config the widget generates, but there's an eval_frequency option - it should be filled in if you use fill-config as recommended. The default value is relatively low, and evaluation is slow. You should increase this value a lot if you have a large amount of training data.

Use the efficiency presets instead of accuracy. If speed is an issue then you should try this. For your pipeline, the relevant options are whether to include static vectors or not, and the width or depth of your tok2vec. Note this alone won't affect speed that much, but because it definitely reduces memory usage it can be usefully combined with the next option.

Increase batch size. In training the time to process a batch is relatively constant, so larger batches means fewer batches for the same data, which means faster training. How large a batch you can handle depends on the size of your documents and your hardware.

Use less training data. This is very rarely something that I'd recommend, but if you have 400,000 records you shouldn't need that many to get a good NER model. (How many classes do you have?) Try 10,000 to start with and see how your model performs, and scale up until you get the accuracy/speed tradeoff you want. This will also help you figure out if there is some kind of issue with your data more quickly.

For tips on faster inference (not training), see the spaCy speed FAQ.

Upvotes: 2

Dan Taninecz Miller

Reputation: 121

You might just be using one core of your CPU, as that is kind of the Python default iirc. I would look into parallelizing the job with joblib and increasing your chunk size.

See: https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html#Option-3:-Parallelize-the-work-using-joblib

Upvotes: 1

How to make spacy train faster on NER for Persian language

Answers (2)

Related Questions