flamenco
flamenco

Reputation: 2840

Aggregate values of a column in a pandas df based on multiple filters of a different column - python2.7

Starting pandas df is:

df = pd.DataFrame({
    'event':['caller',  'X',  'y',   'X', 'caller', 'caller', 'z', 'z', 'X',  'X',  'w',  'X',  'y',  'X',  'z',  'caller'],
    'value':['c1',      'x1', 'yy', 'x2', 'c2',     'c3',     'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
})
df
Out[24]: 
     event value
0   caller    c1
1        X    x1
2        y    yy
3        X    x2
4   caller    c2
5   caller    c3
6        z    zz
7        z    zz
8        X    x1
9        X    x2
10       w    ww
11       X    x3
12       y    yy
13       X    x4
14       z    z1
15  caller    c4

The goal is to aggregate all values where event==X in one single value that might be (or not!) for each section where event==caller .

Note* that:

  1. event == X might be or not consecutive within a caller section.
  2. there might be cases when event == X might not be at all in some of the caller section.
  3. there might be duplicated cases like event == z. These should not be aggregated.

Thus, to keep it simple, only values where event==X which might appear anywhere within caller section should be aggregated.

The final df (`df_aggr) should look like this:

df_aggr = pd.DataFrame({
    'event':['caller',  'X',     'y',    'caller', 'caller', 'z', 'z', 'X',           'w',  'y',  'z',  'caller'],
    'value':['c1',      'x1 x2', 'yy',   'c2',     'c3',     'zz', 'zz', 'x1 x2 x3 x4', 'ww', 'yy', 'z1',  'c4']
})
df_aggr
Out[28]: 
     event        value
0   caller           c1
1        X        x1 x2
2        y           yy
3   caller           c2
4   caller           c3
5        z           zz
6        z           zz
7        X  x1 x2 x3 x4
8        w           ww
9        y           yy
10       z           z1
11  caller           c4

It has to work on python2.7 and pandas=0.15.2 .

UPDATE:

-----Solution for pandas=0.15.2------

Based on David's answer, in case someone might need to support pandas==0.15.2, it had to be tweaked around to accommodate the differences between version.

In [36]: df = pd.DataFrame({
    ...:     'event':['caller',  'X',  'y',   'X', 'caller', 'caller', 'z', 'z', 'X',  'X',  'w',  'X',  'y',  'X',  'z',  'caller'],
    ...:     'value':['c1',      'x1', 'yy', 'x2', 'c2',     'c3',     'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
    ...: })
    ...:
    ...: s = (df['event'] == 'caller').cumsum()
    ...: df['value'] = df['value'].where(df['value'].mask(df['event'] == 'X'), df.groupby(['event', s])['value'].transform(' '.join))
    ...: df = df[~((df.duplicated()) & (df['event'] == 'X'))].reset_index(drop=True)
    ...:

In [37]: df
Out[37]:
     event        value
0   caller           c1
1        X        x1 x2
2        y           yy
3   caller           c2
4   caller           c3
5        z           zz
6        z           zz
7        X  x1 x2 x3 x4
8        w           ww
9        y           yy
10       z           z1
11  caller           c4

In [38]:

Upvotes: 2

Views: 99

Answers (2)

David Erickson
David Erickson

Reputation: 16683

  1. Similar to Quang's approach, you can separate into 'caller' sections by using cumsum(). cumsum and cumcount are ever useful for categorization.
  2. Then, use mask to conditionally join the grouped value if equal to x -- otherwise do nothing.
  3. Finally, drop_duplicates(), as we haven't removed the extra row yet that was joined.

df = pd.DataFrame({
    'event':['caller',  'X',  'y',   'X', 'caller', 'caller', 'z', 'z', 'X',  'X',  'w',  'X',  'y',  'X',  'z',  'caller'],
    'value':['c1',      'x1', 'yy', 'x2', 'c2',     'c3',     'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
})
s = (df['event'] == 'caller').cumsum()
df['value'] = df['value'].mask(df['event'] == 'X',
                               df.groupby(['event', s])['value'].transform(' '.join))
df = df[~((df.duplicated(keep='first')) & (df['event'] == 'X'))].reset_index(drop=True)
df

Out[1]: 
     event        value
0   caller           c1
1        X        x1 x2
2        y           yy
3   caller           c2
4   caller           c3
5        z           zz
6        z           zz
7        X  x1 x2 x3 x4
8        w           ww
9        y           yy
10       z           z1
11  caller           c4

Upvotes: 1

Quang Hoang
Quang Hoang

Reputation: 150785

Use cumsum on check for caller to identify the session and groupby:

(df.groupby([df.event.eq('caller').cumsum(),
            'event'])['value'].agg(' '.join)
   .reset_index(level=1)
   .reset_index(drop=True)
)

Note That will concatenate all value with the same event within a session, i.e. if you have two z, the two value for those rows will be concatenated. Throw df.event.ne('x').cumsum() in to groupby to group the blocks of x only:

(df.groupby([df.event.eq('caller').cumsum(),
            df.event.ne('x').cumsum(),
                  'event'
                 ])['value'].agg(' '.join)
   .reset_index(level=-1)
   .reset_index(drop=True)
)

Output:

    event  value
0  caller     c1
1       x     x1
2       y     yy
3  caller     c2
4  caller     c3
5       x  x1 x2
6       z     z1
7  caller     c4

Upvotes: 0

Related Questions