Reputation: 674
Suppose we have a dataframe that looks like this:
start stop duration
0 A B 1
1 B A 2
2 C D 2
3 D C 0
What's the best way to construct a list of: i) start/stop pairs; ii) count of start/stop pairs; iii) avg duration of start/stop pairs? In this case, order should not matter: (A,B)=(B,A)
.
Desired output: [[start,stop,count,avg duration]]
In this example: [[A,B,2,1.5],[C,D,2,1]]
Upvotes: 7
Views: 2154
Reputation: 1507
In one line, this can also be achieved by
df.apply(lambda x: x.append(pd.Series(','.join([str(x) for x in sorted(x[['start', 'stop']])]))), axis=1).groupby([0]).duration.agg(['count', 'mean'])
Result:
count mean
0
A,B 2 1.5
C,D 2 1.0
Upvotes: 0
Reputation: 402323
sort
the first two columns (you can do this in-place, or create a copy and do the same thing; I've done the former), then groupby
and agg
:
df[['start', 'stop']] = np.sort(df[['start', 'stop']], axis=1)
(df.groupby(['start','stop'])
.duration
.agg(['count', 'mean'])
.reset_index()
.values
.tolist())
# [['A', 'B', 2, 1.5], ['C', 'D', 2, 1.0]]
Upvotes: 9