Aggregate values of a column in a pandas df based on multiple filters of a different column - python2.7

Question

Starting pandas df is:

df = pd.DataFrame({
    'event':['caller',  'X',  'y',   'X', 'caller', 'caller', 'z', 'z', 'X',  'X',  'w',  'X',  'y',  'X',  'z',  'caller'],
    'value':['c1',      'x1', 'yy', 'x2', 'c2',     'c3',     'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
})
df
Out[24]: 
     event value
0   caller    c1
1        X    x1
2        y    yy
3        X    x2
4   caller    c2
5   caller    c3
6        z    zz
7        z    zz
8        X    x1
9        X    x2
10       w    ww
11       X    x3
12       y    yy
13       X    x4
14       z    z1
15  caller    c4

The goal is to aggregate all values where event==X in one single value that might be (or not!) for each section where event==caller .

Note* that:

event == X might be or not consecutive within a caller section.
there might be cases when event == X might not be at all in some of the caller section.
there might be duplicated cases like event == z. These should not be aggregated.

Thus, to keep it simple, only values where event==X which might appear anywhere within caller section should be aggregated.

The final df (`df_aggr) should look like this:

df_aggr = pd.DataFrame({
    'event':['caller',  'X',     'y',    'caller', 'caller', 'z', 'z', 'X',           'w',  'y',  'z',  'caller'],
    'value':['c1',      'x1 x2', 'yy',   'c2',     'c3',     'zz', 'zz', 'x1 x2 x3 x4', 'ww', 'yy', 'z1',  'c4']
})
df_aggr
Out[28]: 
     event        value
0   caller           c1
1        X        x1 x2
2        y           yy
3   caller           c2
4   caller           c3
5        z           zz
6        z           zz
7        X  x1 x2 x3 x4
8        w           ww
9        y           yy
10       z           z1
11  caller           c4

It has to work on python2.7 and pandas=0.15.2 .

UPDATE:

-----Solution for pandas=0.15.2------

Based on David's answer, in case someone might need to support pandas==0.15.2, it had to be tweaked around to accommodate the differences between version.

In [36]: df = pd.DataFrame({
    ...:     'event':['caller',  'X',  'y',   'X', 'caller', 'caller', 'z', 'z', 'X',  'X',  'w',  'X',  'y',  'X',  'z',  'caller'],
    ...:     'value':['c1',      'x1', 'yy', 'x2', 'c2',     'c3',     'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
    ...: })
    ...:
    ...: s = (df['event'] == 'caller').cumsum()
    ...: df['value'] = df['value'].where(df['value'].mask(df['event'] == 'X'), df.groupby(['event', s])['value'].transform(' '.join))
    ...: df = df[~((df.duplicated()) & (df['event'] == 'X'))].reset_index(drop=True)
    ...:

In [37]: df
Out[37]:
     event        value
0   caller           c1
1        X        x1 x2
2        y           yy
3   caller           c2
4   caller           c3
5        z           zz
6        z           zz
7        X  x1 x2 x3 x4
8        w           ww
9        y           yy
10       z           z1
11  caller           c4

In [38]:

David Erickson · Accepted Answer

Similar to Quang's approach, you can separate into 'caller' sections by using cumsum(). cumsum and cumcount are ever useful for categorization.
Then, use mask to conditionally join the grouped value if equal to x -- otherwise do nothing.
Finally, drop_duplicates(), as we haven't removed the extra row yet that was joined.

df = pd.DataFrame({
    'event':['caller',  'X',  'y',   'X', 'caller', 'caller', 'z', 'z', 'X',  'X',  'w',  'X',  'y',  'X',  'z',  'caller'],
    'value':['c1',      'x1', 'yy', 'x2', 'c2',     'c3',     'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
})
s = (df['event'] == 'caller').cumsum()
df['value'] = df['value'].mask(df['event'] == 'X',
                               df.groupby(['event', s])['value'].transform(' '.join))
df = df[~((df.duplicated(keep='first')) & (df['event'] == 'X'))].reset_index(drop=True)
df

Out[1]: 
     event        value
0   caller           c1
1        X        x1 x2
2        y           yy
3   caller           c2
4   caller           c3
5        z           zz
6        z           zz
7        X  x1 x2 x3 x4
8        w           ww
9        y           yy
10       z           z1
11  caller           c4

Aggregate values of a column in a pandas df based on multiple filters of a different column - python2.7

Answers (2)

Related Questions