Reputation: 40618
Once I have a dask dataframe, how can I selectively pull columns into an in-memory pandas DataFrame? Say I have an N x M dataframe. How can I create an N x m dataframe where m << M and is arbitrary.
from sklearn.datasets import load_iris
import dask.dataframe as dd
d = load_iris()
df = pd.DataFrame(d.data)
ddf = dd.from_pandas(df, chunksize=100)
in_memory = ddf.iloc[:,2:4].compute()
ddf.map_partitions(lambda x: x.iloc[:,2:4]).compute()
map_partitions
works but it was quite slow on a file that wasn't very large. I hope I am missing something very obvious.
Upvotes: 6
Views: 8749
Reputation: 28673
Although iloc is not implemented for dask-dataframes, you can achieve the indexing easily enough as follows:
cols = list(ddf.columns[2:4])
ddf[cols].compute()
This has the additional benefit, that dask knows immediately the types of the columns selected, and needs to do no additional work. For the map_partitions
variant, dask at the least needs to check the data types produces, since the function you call is completely arbitrary.
Upvotes: 8