Reputation: 77
I have gone all similar question and solutions provided, but not getting desired output.
I have a list of dask delayed objects.
for y in ys:
projection = Projection(data, X, y)
fi = projection.decode()
var.append(fi)
where Projection class and decode method are following:
class Projection(object):
def __init__(self, data, X, y=0):
# data is dataframe, X is indecies of independent variables and y is index of dependent variable
self.data = data
self.X = X
self.y = y
...
...
@dask.delayed
def decode(self) -> list:
regressor = RandomForestRegressor(n_estimators=50, max_features='sqrt', n_jobs=-1, max_depth=6, verbose=0)
regressor.fit(self.X, self.y)
fi = regressor.feature_importances_
return fi
Var is:
[Delayed('decode-82afe417-9d1e-48ff-95a3-02ddc90c6970'),
Delayed('decode-0a872626-996a-4a19-8b45-b39acb44257f'),
Delayed('decode-cfa53fd4-cf5b-47f1-a672-440dc5f5ca35'),
Delayed('decode-29cf7f51-2e7a-4c9d-8ac0-bc2259d50b6f'),
Delayed('decode-2edc8324-f9df-4402-a1ed-44a6a9067f1d'),
Delayed('decode-05de7417-49a5-40b7-8098-f2aad50bd934'),
Delayed('decode-80916f08-2d28-4811-9ab4-e526af978aac'),
Delayed('decode-da4a8874-77b5-4d75-aede-c96b5e73e888'),
Delayed('decode-1c1fe7f0-a32b-4a0a-9d13-bb45710a3738')
Now I want to compute this var and want to get a list or array or data frame. For that purpose, I tried various options:
option1
dask.compute(*var)
option2
v = dask.array.from_array(np.array(var), chunks=(100,))
dask.array.compute(*v)
option3
v = dask.array.from_delayed(np.array(var))
dask.array.compute(*v)
option4
v = dask.array.from_delayed(np.array(var))
v.compute()
but in all cases, either I get again the list of delayed objects or time out.
Option-1 giving following error:
numpy.core._exceptions.MemoryError: Unable to allocate 458. MiB for an array with shape (19971, 3005) and data type int64
Thanks in advance.
Upvotes: 2
Views: 2442
Reputation: 16551
Option 1 appears to be the most appropriate one, Options 3 and 4 will result in a list of delayed objects because in those options v
contains nested delayed objects.
It would help to know more details about the setup (local/distributed), data magnitude, computation intensity, and the activity on the dask dashboard.
Upvotes: 2