Reputation: 1592
Based on Incremental PCA on big data and the incremental PCA docs its suggested to use a memmap array but would it be possible to accomplish the same thing using dask?
Update Expanded the question to include other partial fit algorithms as the git repo for dask mentions a method of using any scikit-learn which supports partial fit but I can't seem to find documentation on it in the API. When I attempted incremental pca for an 6000x250000 float64
dask dataframe it took 8 hrs to make 9% progress on 16 core 104GB vm without adjusting the dask scheduler but I wasn't sure if it was down to my poor code or if thats what to expect with a dataset of this size. I'd welcome any advice on batch sizing for SGD even just as proof of concepthttps://github.com/dask/dask/blob/master/dask/array/learn.pyhttp://matthewrocklin.com/blog/work/2016/07/12/dask-learn-part-1
Upvotes: 2
Views: 586
Reputation: 57271
The dask.array.linalg.svd function operates in parallel in small space.
The fit and predict functions in dask.array support any sklearn.Estimator
with a partial_fit
method.
The dask-learn project handles partial_fit
, grid searches, pipelines, etc.. See this three part blog series by Jim Crist about the project:
Upvotes: 4