Reputation: 3159
I want to split this DataFrame into a predefined number of chunks of approximately the same size:
import pandas as pd
df = pd.DataFrame({
"user": ["A", "A", "B", "C", "C", "C"],
"value": [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]})
# user value
# 0 A 0.3
# 1 A 0.4
# 2 B 0.5
# 3 C 0.6
# 4 C 0.7
# 5 C 0.8
The DataFrame is large (millions of lines) so the code should be more or less efficient. The problem is that certain user should only be present in one of the chunks.
E.g. if the number of chunks is 3 then:
# Chunk #1 (DataFrame):
# 0 A 0.3
# 1 A 0.4
# Chunk #2 (DataFrame):
# 2 B 0.5
# Chunk #3 (DataFrame):
# 3 C 0.6
# 4 C 0.7
# 5 C 0.8
This chunking in 3 pieces will be incorrect since user C will be present in 2 chunks:
# Chunk #1 (DataFrame):
# 0 A 0.3
# 1 A 0.4
# Chunk #2 (DataFrame):
# 2 B 0.5
# 3 C 0.6
# Chunk #3 (DataFrame):
# 4 C 0.7
# 5 C 0.8
I think that some solution when we do groupby by user first and then split this DataFrameGroupBy object in chunks should suffice.
Upvotes: 3
Views: 1400
Reputation: 5294
You could turn our user
column into a categorical one and use qcut
for uniform height binning. Unfortunately qcut
fails to find unique bin edges for discontinuous distributions so you might have some issue if one user is over represented. You could use duplicates="drop"
but you won't always have the number bins you requested as some will be clumped together.
Guess you will have to write some algorithm for proper rebinning, couldn't find anything ready out of the box.
Here's an example with pd.qcut
.
Let's build a dummy dataset
user = np.random.choice(["A", "B", "C", "D", "E", "F", "G", "H"], 10000)
value = np.random.random(size=user.shape)
df = pd.DataFrame({"user": user, "value": value})
print(df.user.value_counts())
E 1329
C 1281
G 1277
F 1260
H 1231
D 1223
A 1205
B 1194
Name: user, dtype: int64
assign an integer code to each unique user and rebin with qcut
codes = df.user.astype("category").cat.codes
nbins = 3
df["bin"] = pd.qcut(codes, nbins, labels=False)
df.groupby("user").bin.value_counts()
let's check the results
print(df.bin.value_counts())
1 3788
0 3629
2 2583
Name: bin, dtype: int64
print(df.groupby("user").bin.value_counts())
user bin
A 0 1266
B 0 1158
C 0 1205
D 1 1255
E 1 1246
F 1 1287
G 2 1274
H 2 1309
Name: bin, dtype: int64
Upvotes: 2
Reputation: 75080
You can also try np.split
with some conditions in place checking if the chunks are same as the no of elements in user, then split across user, else split keeping the first n users in the list:
def split_fun(data,n):
cond = len(set(data['user'])) == n
f = data['user'].factorize()[0]+1
if cond:
p = np.where(np.diff(f)>0)[0]+1
else:
p= np.where(np.diff((f>n).view('i1'))>0)[0]+1
return np.split(data,p)
Sample Runs:
split_fun(df,2)
[ user value
0 A 0.3
1 A 0.4
2 B 0.5,
user value
3 C 0.6
4 C 0.7
5 C 0.8]
split_fun(df,3)
[ user value
0 A 0.3
1 A 0.4,
user value
2 B 0.5,
user value
3 C 0.6
4 C 0.7
5 C 0.8]
Upvotes: 1
Reputation: 13821
Will this suffice?
df_grouped = df.groupby('user')
df_list = [df for user, df in df_grouped]
Out[1352]:
[ user value
0 A 0.3
1 A 0.4,
user value
2 B 0.5,
user value
3 C 0.6
4 C 0.7
5 C 0.8]
This runs relatively fast on my machine:
>>> df.shape
(7200000, 2)
>>> print(end - start)
0.532534122467041
Upvotes: 1