Reputation: 1387
This question is in continuation with this one which I asked here:
Now I have data something like this:
Sno User Cookie
1 1 A
2 1 A
3 1 A
4 1 B
5 1 C
6 1 D
7 1 A
8 1 B
9 1 D
10 1 E
11 1 D
12 1 A
13 2 F
14 2 G
15 2 F
16 2 G
17 2 H
18 2 H
So lets say we have 5 cookies for user 1 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So this is what I will get if I run code mentioned in my previous post:
For User 1
Sno Cookie Count
1 A 2
2 B 1
3 C 0
4 D 2
5 E 0
For User 2
Sno Cookie Count
6 F 1
7 G 1
8 H 0
Now comes the tricky part, now we know by the count, that for user 1, three cookies "A, B and D" re-occurred. Similarly for User 2 "F and G" reoccurred. I want to aggregate these results like this:
Sno User Reoccurred_Instances
1 1 3
2 2 2
Is there any easier way without using a loop to get this result.
Upvotes: 1
Views: 66
Reputation: 374
Another approach to this, which I think should be pretty flexible:
dups = df.loc[:, ['User', 'Cookie']].duplicated()
diffs = df.Cookie != df.Cookie.shift()
flags = np.logical_and(dups, diffs)
df['flag'] = flags
result_1 = df.groupby(['User', 'Cookie'])['flag'].agg([('Count', sum)])
result_2 = result_1.groupby('User')['Count'].agg([('Reoccurred_Instances', lambda x: (x > 0).sum())])
It'll take duplicates in terms of User
and Cookie
, and then it'll compare Cookie
values to their neighbors. Finally, append a column of True
/False
corresponding to the two aforementioned flags. Use that column to create the two summary tables you mentioned (result_1
and result_2
, below).
Count
User Cookie
1 A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
2 F 1.0
G 1.0
H 0.0
Reoccurred_Instances
User
1 3.0
2 2.0
EDIT: As mentioned in a comment below, let's assume a third user with Cookie values H
, H
, H
, J
, J
, J
. Like this:
Sno User Cookie
0 1 1 A
1 2 1 A
2 3 1 A
3 4 1 B
4 5 1 C
5 6 1 D
6 7 1 A
7 8 1 B
8 9 1 D
9 10 1 E
10 11 1 D
11 12 1 A
12 13 2 F
13 14 2 G
14 15 2 F
15 16 2 G
16 17 2 H
17 18 2 H
18 19 3 H
19 20 3 H
20 21 3 H
21 22 3 J
22 23 3 J
23 24 3 J
Running this through the code above, we get the following result dataframes:
Count
User Cookie
1 A 2.0
B 1.0
C 0.0
D 2.0
E 0.0
2 F 1.0
G 1.0
H 0.0
3 H 0.0
J 0.0
And:
Reoccurred_Instances
User
1 3.0
2 2.0
3 0.0
The way the code is set up, it'll still show the third user (who does not have any repeat Cookie
values) with a Reoccurred_Instances
value of 0.
Upvotes: 1
Reputation: 51335
Following the same first steps as I took in my answer to your previous question, to get rid of consecutive Cookie
values and find the duplicates:
no_doubles = df[df.Cookie != df.Cookie.shift()]
no_doubles['dups'] = no_doubles.Cookie.duplicated()
Then use a groupby to group by User
on the subset of data that are indeed duplicated (no_doubles[no_doubles['dups']]
), and find the number of unique Cookies
for each user using nunique
:
no_doubles[no_doubles['dups']].groupby('User')['Cookie'].nunique().reset_index()
This returns:
User Cookie
0 1 3
1 2 2
You can rename the columns as desired
[EDIT]:
To deal with different cases, you can just add to this logic. For example, considering the following dataframe with no repeats in User
number 3:
Sno User Cookie
1 1 A
2 1 A
3 1 A
4 1 B
5 1 C
6 1 D
7 1 A
8 1 B
9 1 D
10 1 E
11 1 D
12 1 A
13 2 F
14 2 G
15 2 F
16 2 G
17 2 H
18 2 H
18 3 H
18 3 I
18 3 J
You can do:
no_doubles = df[(df.Cookie != df.Cookie.shift()) | (df.User != df.User.shift())]
no_doubles['dups'] = no_doubles.duplicated(['Cookie', 'User'])
no_doubles.groupby('User').apply(lambda x: x[x.dups]['Cookie'].nunique()).to_frame('Reoccurred_Instances')
To get:
Reoccurred_Instances
User
1 3
2 2
3 0
Upvotes: 1