pichlbaer
pichlbaer

Reputation: 923

Python Dask dataframe separation based on column value

I'm a complete newbie to python dask (a little experience with pandas). I have a large Dask Dataframe (~10 to 20 million rows) that I have to separate based on a unique column value.

For exmaple if I have the following Dataframe with column C1 to Cn (sorry, don't know how to make a proper table in stackoverflow) and I want to create subset Dataframes for each unique value of the column C2

Base Dataframe:


|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| QK |....|time|
Subset Dataframes to be created:

Subset 1:

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| AE |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| AE |....|time|

Subset 2

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| FB |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| FB |....|time|


and so on.

My current approach is getting all unique values of C2 and filtering the base dataframe for each of this values iteratively. But this takes way to long time. I'm doing research at the moment on how I can improve this process, but I would appreciate it a lot if any of you can give me some tips.

Upvotes: 4

Views: 3030

Answers (1)

rpanai
rpanai

Reputation: 13437

It seems to me that you can obtain the same subsets with groupby both in pandas and dask.

import pandas as pd
import dask.dataframe as dd
import numpy as np
import string

N = 5
rndm2 = lambda :"".join(np.random.choice(list(string.ascii_lowercase), 2))
df_sample = pd.DataFrame({"C1":np.arange(N),
                          "C2":[rndm2() for i in range(N)],
                          "C3":np.random.randn(N)})

M = 2
df = pd.concat([df_sample for i in range(M)], ignore_index=True)
df["C4"] =  np.random.randn(N*M)

Here I'm just printing print(list(df.groupby("C2"))[0][1]) to show you what you have inside every group:

   C1  C2        C3        C4
3   3  bx  0.668654 -0.237081
8   3  bx  0.668654  0.619883

If you need to have to disk nicely partitioned you can do the following

ddf = dd.from_pandas(df, npartitions=4)
ddf.to_parquet("saved/", partition_on=["C2"])

# You can check that the parquet files
# are in separated folder as
! ls saved/ # If you are on Linux

'C2=iw'  'C2=jl'  'C2=qf'  'C2=wy'  'C2=yr'   _common_metadata

Now if you want to perform some computation using these groups you can apply your function fun with map_partitions taking care about the output meta.

df = dd.read_parquet("saved/")
out = df.map_partitions(lambda x: fun(x)).compute() # you should add your output meta

Upvotes: 3

Related Questions