Reputation: 2840
Starting pandas df is:
df = pd.DataFrame({
'event':['caller', 'X', 'y', 'X', 'caller', 'caller', 'z', 'z', 'X', 'X', 'w', 'X', 'y', 'X', 'z', 'caller'],
'value':['c1', 'x1', 'yy', 'x2', 'c2', 'c3', 'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
})
df
Out[24]:
event value
0 caller c1
1 X x1
2 y yy
3 X x2
4 caller c2
5 caller c3
6 z zz
7 z zz
8 X x1
9 X x2
10 w ww
11 X x3
12 y yy
13 X x4
14 z z1
15 caller c4
The goal is to aggregate all values where event==X
in one single value
that might be (or not!) for each section where event==caller
.
Note* that:
event == X
might be or not consecutive within a caller
section.event == X
might not be at all in some of the caller
section.event == z
. These should not be aggregated.Thus, to keep it simple, only values where event==X
which might appear anywhere within caller
section should be aggregated.
The final df (`df_aggr) should look like this:
df_aggr = pd.DataFrame({
'event':['caller', 'X', 'y', 'caller', 'caller', 'z', 'z', 'X', 'w', 'y', 'z', 'caller'],
'value':['c1', 'x1 x2', 'yy', 'c2', 'c3', 'zz', 'zz', 'x1 x2 x3 x4', 'ww', 'yy', 'z1', 'c4']
})
df_aggr
Out[28]:
event value
0 caller c1
1 X x1 x2
2 y yy
3 caller c2
4 caller c3
5 z zz
6 z zz
7 X x1 x2 x3 x4
8 w ww
9 y yy
10 z z1
11 caller c4
It has to work on python2.7 and pandas=0.15.2
.
UPDATE:
-----Solution for pandas=0.15.2
------
Based on David's answer, in case someone might need to support pandas==0.15.2
, it had to be tweaked around to accommodate the differences between version.
In [36]: df = pd.DataFrame({
...: 'event':['caller', 'X', 'y', 'X', 'caller', 'caller', 'z', 'z', 'X', 'X', 'w', 'X', 'y', 'X', 'z', 'caller'],
...: 'value':['c1', 'x1', 'yy', 'x2', 'c2', 'c3', 'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
...: })
...:
...: s = (df['event'] == 'caller').cumsum()
...: df['value'] = df['value'].where(df['value'].mask(df['event'] == 'X'), df.groupby(['event', s])['value'].transform(' '.join))
...: df = df[~((df.duplicated()) & (df['event'] == 'X'))].reset_index(drop=True)
...:
In [37]: df
Out[37]:
event value
0 caller c1
1 X x1 x2
2 y yy
3 caller c2
4 caller c3
5 z zz
6 z zz
7 X x1 x2 x3 x4
8 w ww
9 y yy
10 z z1
11 caller c4
In [38]:
Upvotes: 2
Views: 99
Reputation: 16683
cumsum()
. cumsum
and cumcount
are ever useful for categorization.mask
to conditionally join
the grouped value
if equal to x
-- otherwise do nothing.drop_duplicates()
, as we haven't removed the extra row yet that was joined.df = pd.DataFrame({
'event':['caller', 'X', 'y', 'X', 'caller', 'caller', 'z', 'z', 'X', 'X', 'w', 'X', 'y', 'X', 'z', 'caller'],
'value':['c1', 'x1', 'yy', 'x2', 'c2', 'c3', 'zz', 'zz', 'x1', 'x2', 'ww', 'x3', 'yy', 'x4', 'z1', 'c4']
})
s = (df['event'] == 'caller').cumsum()
df['value'] = df['value'].mask(df['event'] == 'X',
df.groupby(['event', s])['value'].transform(' '.join))
df = df[~((df.duplicated(keep='first')) & (df['event'] == 'X'))].reset_index(drop=True)
df
Out[1]:
event value
0 caller c1
1 X x1 x2
2 y yy
3 caller c2
4 caller c3
5 z zz
6 z zz
7 X x1 x2 x3 x4
8 w ww
9 y yy
10 z z1
11 caller c4
Upvotes: 1
Reputation: 150785
Use cumsum
on check for caller
to identify the session and groupby:
(df.groupby([df.event.eq('caller').cumsum(),
'event'])['value'].agg(' '.join)
.reset_index(level=1)
.reset_index(drop=True)
)
Note That will concatenate all value
with the same event
within a session, i.e. if you have two z
, the two value
for those rows will be concatenated. Throw df.event.ne('x').cumsum()
in to groupby
to group the blocks of x
only:
(df.groupby([df.event.eq('caller').cumsum(),
df.event.ne('x').cumsum(),
'event'
])['value'].agg(' '.join)
.reset_index(level=-1)
.reset_index(drop=True)
)
Output:
event value
0 caller c1
1 x x1
2 y yy
3 caller c2
4 caller c3
5 x x1 x2
6 z z1
7 caller c4
Upvotes: 0