Training SGDRegressor on a dataset in chunks

Question

For a machine learning task I need to deal with quite large data sets. As a result, I cannot fit the entire data set at once in my algorithm. I am looking for a way to train my algorithm in parts on the data set, simply feeding new chunks won't work since my algorithm will just refit and not won't take the previous examples into account. Is there a method with which I can feed my algorithm new data, while "remembering" the previous data seen before?

Edit: The algorithm I use is the SGDRegressor from scikit-learn.

The code:

train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
labels = pd.read_csv(os.path.join(dir,"Labels.csv"),chunksize = 5000)
algo = SGDRegressor(n_iter = 75)
print("looping for chunks in train")
for chunk in train:
    algo.fit(train,labels)

alko · Accepted Answer

You can use partial_fit to feed parts of training data to SGDRegressor.

See sample code in examples.

Training SGDRegressor on a dataset in chunks

Answers (1)

Related Questions