Reputation: 3196
I have a dataframe like the following one:
ID URL seconds
1 Email 9
1 Email 3
1 App 5
1 App 9
1 Faceboook 50
1 Faceboook 7
1 Faceboook 39
1 Faceboook 10
1 Email 39
1 Email 5
1 Email 57
1 Faceboook 7
1 Faceboook 32
1 Faceboook 3
2 App 11
2 App 10
2 Email 56
2 Faceboook 9
2 Faceboook 46
2 Faceboook 16
2 Email 21
I want to sum the 'seconds' column for successive views of the same URL by the same ID. That's the result I'm looking for:
ID URL seconds
1 Email 12
1 App 14
1 Faceboook 106
1 Email 101
1 Faceboook 42
2 App 21
2 Email 56
2 Faceboook 71
2 Email 21
df.groupBy(['ID', 'URL']).sum()
would not work in this case as it would sum all cases of the same URL for the same ID, not only the successive ones.
Any ideas?
Upvotes: 1
Views: 67
Reputation: 863166
You can use groupby
by Series
created by compare by ne
column URL
and shifted, last use cumsum
with boolean mask
:
print ((df.URL.ne(df.URL.shift())).cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 3
8 4
9 4
10 4
11 5
12 5
13 5
14 6
15 6
16 7
17 8
18 8
19 8
20 9
Name: URL, dtype: int32
print (df['seconds'].groupby([(df.URL.ne(df.URL.shift())).cumsum(), df.ID, df.URL]).sum())
URL ID URL
1 1 Email 12
2 1 App 14
3 1 Faceboook 106
4 1 Email 101
5 1 Faceboook 42
6 2 App 21
7 2 Email 56
8 2 Faceboook 71
9 2 Email 21
Name: seconds, dtype: int64
print (df['seconds'].groupby([(df.URL.ne(df.URL.shift())).cumsum(), df.ID, df.URL])
.sum()
.reset_index(level=0, drop=True)
.reset_index())
ID URL seconds
0 1 Email 12
1 1 App 14
2 1 Faceboook 106
3 1 Email 101
4 1 Faceboook 42
5 2 App 21
6 2 Email 56
7 2 Faceboook 71
8 2 Email 21
Upvotes: 3