Serge Kashlik
Serge Kashlik

Reputation: 413

How to efficinetly combine dataframe rows based on conditions?

I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.

clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])

   clust    obs  x_max
0      0   1041     10
1      1    544     20
2      2    310     30
3      3   1648     40
4      4   1862     50
5      5   2120     60
6      6   2916     70
7      7   5148     80
8      8  12733     90
9      9      1    100

My task is to combine the clust row values with adjasent rows, so that each cluster contains at least 1000 observations. My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.

condition = True
while (condition):
    condition = False
    for i in np.arange(0, len(df) + 1):
        if df.loc[i, 'x'] < 1000:
            df.loc[i, 'id'] = df.loc[i, 'id'] + 1
            df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
            condition = True
            break

Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row condition, but it seems here is no such condition in python.
The resulting table should look like this

clust obs     x_max
0     1041    10
1     2502    40
2     1862    50
3     2120    60
4     2916    70
5     5148    80
6     12734   100

Upvotes: 1

Views: 59

Answers (1)

Ben.T
Ben.T

Reputation: 29635

Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc at each iteration. Also, not a good practice to change df within the loop, it can only bring problem.

# define variables
s = 0   #for the sum of observations
gr = [] #for the final grouping values
i = 0   #for the group indices

# loop over observations from an array
for obs in df['obs'].to_numpy():
    s+= obs
    gr.append(i)
    # check that the size of the group is big enough
    if s>1000:
        s = 0
        i+=1
# condition to deal with last rows if last group not big enough 
if s!=0:
    gr = [i-1 if val==i else val for val in gr]    
    
# now create your new df
new_df = (
    df.groupby(gr).agg({'obs':sum, 'x_max':max})
      .reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
#    cluster    obs  x_max
# 0        0   1041     10
# 1        1   2502     40
# 2        2   1862     50
# 3        3   2120     60
# 4        4   2916     70
# 5        5   5148     80
# 6        6  12734    100

Upvotes: 1

Related Questions