Reputation: 2232
I have a large language corpus and I use sklearn tfidf vectorizer and gensim Doc2Vec to compute language models. My total corpus has about 100,000 documents and I realized that my Jupyter notebook stops computing once I cross a certain threshold. I guess that the memory is full after applying the grid-search and cross-validation steps.
Even following example script already stops for Doc2Vec at some point:
%%time
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(1)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=1)
model_names = [
'TfidfVectorizer',
'Doc2Vec_PVDM',
]
models = [
TfidfVectorizer(preprocessor=' '.join, tokenizer=None, min_df = 5),
D2VTransformer(dm=0, hs=0, min_count=5, iter=5, seed=1, workers=1),
]
parameters = [
{
'model__smooth_idf': (True, False),
'model__norm': ('l1', 'l2', None)
},
{
'model__size': [200],
'model__window': [4]
}
]
for params, model, name in zip(parameters, models, model_names):
pipeline = Pipeline([
('model', model),
('clf', LogisticRegression())
])
grid = GridSearchCV(pipeline, params, verbose=1, cv=5, n_jobs=-1)
grid.fit(X_train, y_train)
print(grid.best_params_)
cval = cross_val_score(grid.best_estimator_, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1)
print("Cross-Validation (Train):", np.mean(cval))
print("Finished.")
Is there a way to "stream" each line in a document, instead of loading the full data into memory? Or another way to make it more memory efficient? I read a few articles on the topic but could not discover any that included a pipeline example.
Upvotes: 1
Views: 473
Reputation: 54233
With just 100,000 documents, unless they're gigantic, it's not necessarily the loading-of-data into memory that's causing you problems. Note especially:
Doc2Vec
) are happy with streamed data of arbitrary size, it's harder to adapt that into scikit-learnSo you should look elsewhere, and there are other issues with your shown code.
I've often had memory or lockup issues with scikit-learn's attempts at parallelism (as enabled through n_jobs
-like parameters), especially inside Jupyter notebooks. It forks full OS processes, which tend to blow up memory usage. (Each sub-process gets a full copy of the parent process's memory, which might be efficiently shared – until the subprocess starts moving/changing things.) Sometimes one process, or inter-process communication, fails and the main process is just left waiting for a response – which seems to especially confuse Jupyter notebooks.
So, unless you have tons of memory and absolutely need scikit-learn parallelism, I'd recommend trying to get things working with n_jobs=1
first – and only later experimenting with more jobs.
In contrast, the workers
of the Doc2Vec
class (and D2VTransformer
) uses lighter-weight threads, and you should use at least workers=3
, and perhaps 8 (if you have at least that many cores, rather than the workers=1
you're using now.
But also: you're doing a bunch of redundant actions of unclear value in your code. The test set from initial train-test split isn't ever used. (Perhaps you were thinking of keeping it aside as a final validation set? That's the most rigorous way to get a good estimate of your final result's performance on future unseen data, but in many contexts data is limited, and that estimate isn't as important as just doing the best possible with limited data.)
The GridSearchCV
itself does a 5-way train/test split as part of its work, and its best results are remembered in its properties when it's done.
So you don't need to do the cross_val_score()
again - you can read the results from GridSearchCV
.
Upvotes: 3