Streaming corpus to a vectorizer in a pipeline

Question

I have a large language corpus and I use sklearn tfidf vectorizer and gensim Doc2Vec to compute language models. My total corpus has about 100,000 documents and I realized that my Jupyter notebook stops computing once I cross a certain threshold. I guess that the memory is full after applying the grid-search and cross-validation steps.

Even following example script already stops for Doc2Vec at some point:

%%time
import pandas as pd
import numpy as np
from tqdm import tqdm
from sklearn.externals import joblib

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess

np.random.seed(1)

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
                                                    data.label, random_state=1)

model_names = [
               'TfidfVectorizer',
               'Doc2Vec_PVDM',
              ]

models = [
    TfidfVectorizer(preprocessor=' '.join, tokenizer=None, min_df = 5),
    D2VTransformer(dm=0, hs=0, min_count=5, iter=5, seed=1, workers=1),
]

parameters = [
              {
              'model__smooth_idf': (True, False),
              'model__norm': ('l1', 'l2', None)
              },
              {
              'model__size': [200],
              'model__window': [4]
              }
              ]

for params, model, name in zip(parameters, models, model_names):

    pipeline = Pipeline([
      ('model', model),
      ('clf', LogisticRegression())
      ])

    grid = GridSearchCV(pipeline, params, verbose=1, cv=5, n_jobs=-1)
    grid.fit(X_train, y_train)
    print(grid.best_params_)

    cval = cross_val_score(grid.best_estimator_, X_train, y_train, scoring='accuracy', cv=5, n_jobs=-1)
    print("Cross-Validation (Train):", np.mean(cval))

print("Finished.")

Is there a way to "stream" each line in a document, instead of loading the full data into memory? Or another way to make it more memory efficient? I read a few articles on the topic but could not discover any that included a pipeline example.

gojomo · Accepted Answer

With just 100,000 documents, unless they're gigantic, it's not necessarily the loading-of-data into memory that's causing you problems. Note especially:

loading & tokenizing the docs has already succeeded before you even begin the scikit-learn pipelines/grid-search, and the further multiplication of memory usage is in the necessarily-repeated alternate models, not the original docs
scikit-learn APIs tend to assume the training data is fully in memory – so even though the innermost gensim classes (Doc2Vec) are happy with streamed data of arbitrary size, it's harder to adapt that into scikit-learn

So you should look elsewhere, and there are other issues with your shown code.

I've often had memory or lockup issues with scikit-learn's attempts at parallelism (as enabled through n_jobs-like parameters), especially inside Jupyter notebooks. It forks full OS processes, which tend to blow up memory usage. (Each sub-process gets a full copy of the parent process's memory, which might be efficiently shared – until the subprocess starts moving/changing things.) Sometimes one process, or inter-process communication, fails and the main process is just left waiting for a response – which seems to especially confuse Jupyter notebooks.

So, unless you have tons of memory and absolutely need scikit-learn parallelism, I'd recommend trying to get things working with n_jobs=1 first – and only later experimenting with more jobs.

In contrast, the workers of the Doc2Vec class (and D2VTransformer) uses lighter-weight threads, and you should use at least workers=3, and perhaps 8 (if you have at least that many cores, rather than the workers=1 you're using now.

But also: you're doing a bunch of redundant actions of unclear value in your code. The test set from initial train-test split isn't ever used. (Perhaps you were thinking of keeping it aside as a final validation set? That's the most rigorous way to get a good estimate of your final result's performance on future unseen data, but in many contexts data is limited, and that estimate isn't as important as just doing the best possible with limited data.)

The GridSearchCV itself does a 5-way train/test split as part of its work, and its best results are remembered in its properties when it's done.

So you don't need to do the cross_val_score() again - you can read the results from GridSearchCV.

Streaming corpus to a vectorizer in a pipeline

Answers (1)

Related Questions