Reputation: 923
I'm a complete newbie to python dask (a little experience with pandas). I have a large Dask Dataframe (~10 to 20 million rows) that I have to separate based on a unique column value.
For exmaple if I have the following Dataframe with column C1 to Cn (sorry, don't know how to make a proper table in stackoverflow) and I want to create subset Dataframes for each unique value of the column C2
Base Dataframe:
|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| QK |....|time|
Subset Dataframes to be created:
Subset 1:
|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| AE |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| AE |....|time|
Subset 2
|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| FB |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| FB |....|time|
and so on.
My current approach is getting all unique values of C2 and filtering the base dataframe for each of this values iteratively. But this takes way to long time. I'm doing research at the moment on how I can improve this process, but I would appreciate it a lot if any of you can give me some tips.
Upvotes: 4
Views: 3030
Reputation: 13437
It seems to me that you can obtain the same subsets with groupby
both in pandas
and dask
.
import pandas as pd
import dask.dataframe as dd
import numpy as np
import string
N = 5
rndm2 = lambda :"".join(np.random.choice(list(string.ascii_lowercase), 2))
df_sample = pd.DataFrame({"C1":np.arange(N),
"C2":[rndm2() for i in range(N)],
"C3":np.random.randn(N)})
M = 2
df = pd.concat([df_sample for i in range(M)], ignore_index=True)
df["C4"] = np.random.randn(N*M)
Here I'm just printing print(list(df.groupby("C2"))[0][1])
to show you what you have inside every group:
C1 C2 C3 C4
3 3 bx 0.668654 -0.237081
8 3 bx 0.668654 0.619883
If you need to have to disk nicely partitioned you can do the following
ddf = dd.from_pandas(df, npartitions=4)
ddf.to_parquet("saved/", partition_on=["C2"])
# You can check that the parquet files
# are in separated folder as
! ls saved/ # If you are on Linux
'C2=iw' 'C2=jl' 'C2=qf' 'C2=wy' 'C2=yr' _common_metadata
Now if you want to perform some computation using these groups you can apply your function fun
with map_partitions
taking care about the output meta.
df = dd.read_parquet("saved/")
out = df.map_partitions(lambda x: fun(x)).compute() # you should add your output meta
Upvotes: 3