Reputation: 413
I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.
clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])
clust obs x_max
0 0 1041 10
1 1 544 20
2 2 310 30
3 3 1648 40
4 4 1862 50
5 5 2120 60
6 6 2916 70
7 7 5148 80
8 8 12733 90
9 9 1 100
My task is to combine the clust
row values with adjasent rows, so that each cluster contains at least 1000 observations.
My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.
condition = True
while (condition):
condition = False
for i in np.arange(0, len(df) + 1):
if df.loc[i, 'x'] < 1000:
df.loc[i, 'id'] = df.loc[i, 'id'] + 1
df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
condition = True
break
Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row
condition, but it seems here is no such condition in python.
The resulting table should look like this
clust obs x_max
0 1041 10
1 2502 40
2 1862 50
3 2120 60
4 2916 70
5 5148 80
6 12734 100
Upvotes: 1
Views: 59
Reputation: 29635
Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc
at each iteration. Also, not a good practice to change df
within the loop, it can only bring problem.
# define variables
s = 0 #for the sum of observations
gr = [] #for the final grouping values
i = 0 #for the group indices
# loop over observations from an array
for obs in df['obs'].to_numpy():
s+= obs
gr.append(i)
# check that the size of the group is big enough
if s>1000:
s = 0
i+=1
# condition to deal with last rows if last group not big enough
if s!=0:
gr = [i-1 if val==i else val for val in gr]
# now create your new df
new_df = (
df.groupby(gr).agg({'obs':sum, 'x_max':max})
.reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
# cluster obs x_max
# 0 0 1041 10
# 1 1 2502 40
# 2 2 1862 50
# 3 3 2120 60
# 4 4 2916 70
# 5 5 5148 80
# 6 6 12734 100
Upvotes: 1