Reputation: 166
Problem: I have millions of records that need to be transformed using a bunch of spacy textcat_multilabel models.
// sudo code
for model in models:
nlp = spacy.load(model)
for groups_of_records in records: // millions of records
new_data = nlp.pipe(groups_of_records) // data is getting processed bulk
// process data
bulk_create_records(new_data)
My current loop is as follows:
As you can imagine, the more records i process, and the more models i include, the longer this entire process will take. The idea is to make a single model, and just process my data once, instead of (n * num_of_models)
Question: is there a way to combine multiple textcat_multilabel models created from the same spacy config, into a single textcat_multilabel model?
Upvotes: 1
Views: 519
Reputation: 15623
There is no basic feature to just combine models, but there are a couple of ways you can do this.
One is to source all your components into the same pipeline. This is very easy to do, see the double NER project for an example. The disadvantage is that this might not save you much processing time, since separately trained models will still have their own tok2vec layers.
You could combine your training data and train one big model. But if your models are actually separate that would almost certainly cause a reduction in accuracy.
If speed is the primary concern, you could train each of your textcats separately while freezing your tok2vec. That would result in decreased accuracy, though maybe not too bad, and it would allow you to then combine the textcat models in the same pipeline while removing a bunch of tok2vec processing. (This is probably the method I've listed with the best balance of implementation complexity, speed advantage, and accuracy sacrificed.)
One thing that I don't think has been tested is that you could try training separate textcat models at the same time with separate sets of labels by manually specifying the labels to each component in their configs. I am not completely sure that would work but you could try it.
Upvotes: 1