Konstantin
Konstantin

Reputation: 3159

Split pandas DataFrame into approximately the same chunks

I want to split this DataFrame into a predefined number of chunks of approximately the same size:

import pandas as pd

df = pd.DataFrame({
    "user": ["A", "A", "B", "C", "C", "C"],
    "value": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]})


#     user  value
# 0      A    0.3
# 1      A    0.4
# 2      B    0.5
# 3      C    0.6
# 4      C    0.7
# 5      C    0.8

The DataFrame is large (millions of lines) so the code should be more or less efficient. The problem is that certain user should only be present in one of the chunks.

E.g. if the number of chunks is 3 then:

# Chunk #1 (DataFrame):
# 0      A    0.3
# 1      A    0.4

# Chunk #2 (DataFrame):
# 2      B    0.5

# Chunk #3 (DataFrame):
# 3      C    0.6
# 4      C    0.7
# 5      C    0.8

This chunking in 3 pieces will be incorrect since user C will be present in 2 chunks:

# Chunk #1 (DataFrame):
# 0      A    0.3
# 1      A    0.4

# Chunk #2 (DataFrame):
# 2      B    0.5
# 3      C    0.6

# Chunk #3 (DataFrame):
# 4      C    0.7
# 5      C    0.8

I think that some solution when we do groupby by user first and then split this DataFrameGroupBy object in chunks should suffice.

Upvotes: 3

Views: 1400

Answers (3)

filippo
filippo

Reputation: 5294

You could turn our user column into a categorical one and use qcut for uniform height binning. Unfortunately qcut fails to find unique bin edges for discontinuous distributions so you might have some issue if one user is over represented. You could use duplicates="drop" but you won't always have the number bins you requested as some will be clumped together.

Guess you will have to write some algorithm for proper rebinning, couldn't find anything ready out of the box.

Here's an example with pd.qcut.

Let's build a dummy dataset

user = np.random.choice(["A", "B", "C", "D", "E", "F", "G", "H"], 10000)
value = np.random.random(size=user.shape)
df = pd.DataFrame({"user": user, "value": value})
print(df.user.value_counts())

E    1329
C    1281
G    1277
F    1260
H    1231
D    1223
A    1205
B    1194
Name: user, dtype: int64

assign an integer code to each unique user and rebin with qcut

codes = df.user.astype("category").cat.codes    
nbins = 3
df["bin"] = pd.qcut(codes, nbins, labels=False)
df.groupby("user").bin.value_counts()

let's check the results

print(df.bin.value_counts())
1    3788
0    3629
2    2583
Name: bin, dtype: int64

print(df.groupby("user").bin.value_counts())
user  bin
A     0      1266
B     0      1158
C     0      1205
D     1      1255
E     1      1246
F     1      1287
G     2      1274
H     2      1309
Name: bin, dtype: int64

Upvotes: 2

anky
anky

Reputation: 75080

You can also try np.split with some conditions in place checking if the chunks are same as the no of elements in user, then split across user, else split keeping the first n users in the list:

def split_fun(data,n):
    cond = len(set(data['user'])) == n
    f = data['user'].factorize()[0]+1
    if cond:
        p = np.where(np.diff(f)>0)[0]+1
    else:
        p= np.where(np.diff((f>n).view('i1'))>0)[0]+1
    return np.split(data,p)

Sample Runs:

split_fun(df,2)
[  user  value
 0    A    0.3
 1    A    0.4
 2    B    0.5,
   user  value
 3    C    0.6
 4    C    0.7
 5    C    0.8]

split_fun(df,3)

[  user  value
 0    A    0.3
 1    A    0.4,
   user  value
 2    B    0.5,
   user  value
 3    C    0.6
 4    C    0.7
 5    C    0.8]

Upvotes: 1

sophocles
sophocles

Reputation: 13821

Will this suffice?

df_grouped = df.groupby('user')
df_list = [df for user, df in df_grouped]

Out[1352]: 
[  user  value
 0    A    0.3
 1    A    0.4,
   user  value
 2    B    0.5,
   user  value
 3    C    0.6
 4    C    0.7
 5    C    0.8]

This runs relatively fast on my machine:

>>> df.shape
(7200000, 2)

>>> print(end - start)
0.532534122467041

Upvotes: 1

Related Questions