Learner
Learner

Reputation: 837

Training SGDRegressor on a dataset in chunks

For a machine learning task I need to deal with quite large data sets. As a result, I cannot fit the entire data set at once in my algorithm. I am looking for a way to train my algorithm in parts on the data set, simply feeding new chunks won't work since my algorithm will just refit and not won't take the previous examples into account. Is there a method with which I can feed my algorithm new data, while "remembering" the previous data seen before?

Edit: The algorithm I use is the SGDRegressor from scikit-learn.

The code:

train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
labels = pd.read_csv(os.path.join(dir,"Labels.csv"),chunksize = 5000)
algo = SGDRegressor(n_iter = 75)
print("looping for chunks in train")
for chunk in train:
    algo.fit(train,labels)

Upvotes: 4

Views: 2703

Answers (1)

alko
alko

Reputation: 48317

You can use partial_fit to feed parts of training data to SGDRegressor.

See sample code in examples.

Upvotes: 6

Related Questions