Reputation: 128
How to run dask_cuML (logistic regression for example) on a large dataset, dask_cudf?
I can not run cuML on my cudf dataframe because dataset is large so "OUT of MEMORY" as soon as I try anything. Bright side is I got 4 GPUs to use with dask_cudf.
Does anybody know steps to use to run logistic regression for example on a dask_cudf dataframe?
About my cudf and cuml logistic function:
type(gdf)
cudf.core.dataframe.DataFrame
logreg = cuml.LogisticRegression(penalty='none', tol=1e-6, max_iter=10000)
logreg.fit(gdf[['A', 'B', 'C', 'D', 'E']], gdf['Z'])
My thoughts -- in steps: (Not Working!)
1- Convert gdf cudf to dask_cudf.
ddf = dask_cudf.from_cudf(gdf, npartitions=2) -- what's the number of partitions?
2- meta_dtypes = dict(zip(ddf.columns, ddf.dtypes))
3-
def logistic_regression(gdf):
return logreg.fit(gdf[['A', 'B', 'C', 'D', 'E']], gdf['Z'])
4- ddf = ddf .map_partitions(logistic_regression, meta=meta_dtypes)
ddf.compute().persist()
Any suggestions or insights are appreciated!
Upvotes: 1
Views: 493
Reputation: 86
thank you for trying out cuml
! The official release of cuml doesn't have logistic regression for multiple gpu yet (coming soon!) . I'm implementing a workaround using dask-glm
and cupy
. I'll publish my notebook in this thread once it is ready. Here are the general steps:
ddf = dask_cudf.read_csv("*.csv")
X = ddf[['A', 'B', 'C', 'D', 'E']].values
y = ddf['Z'].values
where each chunk of the dask array is a cupy
array.
from dask_glm.estimators import LogisticRegression
clf = LogisticRegression()
clf.fit(X,y)
Upvotes: 3