Reputation: 837
For a machine learning task I need to deal with quite large data sets. As a result, I cannot fit the entire data set at once in my algorithm. I am looking for a way to train my algorithm in parts on the data set, simply feeding new chunks won't work since my algorithm will just refit and not won't take the previous examples into account. Is there a method with which I can feed my algorithm new data, while "remembering" the previous data seen before?
Edit: The algorithm I use is the SGDRegressor from scikit-learn.
The code:
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
labels = pd.read_csv(os.path.join(dir,"Labels.csv"),chunksize = 5000)
algo = SGDRegressor(n_iter = 75)
print("looping for chunks in train")
for chunk in train:
algo.fit(train,labels)
Upvotes: 4
Views: 2703
Reputation: 48317
You can use partial_fit
to feed parts of training data to SGDRegressor.
See sample code in examples.
Upvotes: 6