Reputation: 11
I have large database of 50GB in size, which consists of excerpts of 486,000 dissertations in 780 specialties. For scientific purposes, it is necessary to conduct training on the basis of this data. But alas, resources are limited to a mobile processor, 16 GB of memory (+ 16Gb SWAP)
The analysis was carried out using a set of 40,000 items (10% of the base) (4.5 GB) and the SGDClassifier classifier, and the memory consumption was around 16-17 GB.
Therefore, I ask the community for help on this.
currently my code is similar
text_clf = Pipeline([
('count', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(n_jobs=8),)
],
)
texts_train, texts_test, cat_train, cat_test = train_test_split(texts, categories_ids, test_size=0.2)
text_clf.fit(texts_train, cat_train)
Therefore, I ask for advice on how to optimize this process so that I can process the entire database.
Upvotes: 0
Views: 202
Reputation: 3779
You can utilize warm_start=True
and call .partial_fit()
(instead of .fit()
).
See the documentation here for the model you are using where it describes that argument and function respectively.
Basically, you would load only a portion of the data at a time, run it through your pipeline and call partial_fit in a loop. This would keep the memory requirements down while also allowing you to train on all the data, regardless of the amount.
EDIT
As noted in the comments, the above mentioned loop will only work for the predictive model, so the data pre-processing will need to occur separately.
Here is a solution for training the CountVectorizer iteratively.
So the final solution would be to preprocess the data in two stages. The first for the CountVectorizer and the second for the TFIDF weighting.
Then to train the model you follow the same process as originally proposed, except without a Pipeline because that is no longer needed.
Upvotes: 1