Reputation: 1887
I have the following dataframe:
I want to create a new column which counts how many times an IP switches domains.
Input:
domain ip timestamp next_domain next_next_domain
0 Google 101 2020-04-01 23:01:41 Facebook N/A
1 Google 101 2020-04-01 23:01:59 Facebook N/A
2 Google 101 2020-04-02 12:01:41 Facebook N/A
3 Facebook 101 2020-04-02 13:11:33 N/A N/A
4 Facebook 101 2020-04-02 13:11:35 N/A N/A
5 Youtube 103 2020-04-21 13:01:41 Google Facebook
6 Youtube 103 2020-04-21 13:11:46 Google Facebook
7 Youtube 103 2020-04-22 01:01:01 Google Facebook
8 Google 103 2020-04-22 02:11:23 Facebook Youtube
9 Facebook 103 2020-04-23 14:11:13 Youtube N/A
10 Youtube 103 2020-04-23 14:11:55 N/A N/A
I have the following code which gives me the following output:
df['switch'] = df.groupby('ip')['domain'].transform(lambda x: x.shift().ne(x).sum()-1)
Output:
domain ip timestamp next_domain next_next_domain switch_count
0 Google 101 2020-04-01 23:01:41 Facebook N/A 1
1 Google 101 2020-04-01 23:01:59 Facebook N/A 1
2 Google 101 2020-04-02 12:01:41 Facebook N/A 1
3 Facebook 101 2020-04-02 13:11:33 N/A N/A 1
4 Facebook 101 2020-04-02 13:11:35 N/A N/A 1
5 Youtube 103 2020-04-21 13:01:41 Google Facebook 3
6 Youtube 103 2020-04-21 13:11:46 Google Facebook 3
7 Youtube 103 2020-04-22 01:01:01 Google Facebook 3
8 Google 103 2020-04-22 02:11:23 Facebook Youtube 3
9 Facebook 103 2020-04-23 14:11:13 Youtube N/A 3
10 Youtube 103 2020-04-23 14:11:55 N/A N/A 3
IP 101 switches 1 time because it goes from Google -> Facebook. IP 103 switches 3 times because it goes from Youtube -> Google -> Facebook -> Youtube.
I have the following solution using "transform" but it takes a long period of time to run on my data set. Is there a faster and more efficient solution that doesn't use transform or apply?
Upvotes: 1
Views: 91
Reputation: 4407
I think you could calculate the difference beforehand (since anyway you subtract 1). However, this requires the df to be sorted by ip
and timestamp
to ensure it works correctly. By doing this outside of groupby
it should be significantly faster.
df['switch_count'] = df['domain'].shift().ne(df['domain'])
df['switch_count'] = df.groupby('ip')['switch_count'].transform('sum') - 1
domain ip timestamp next_domain next_next_domain switch_count
0 Google 101 2020-04-01 23:01:41 Facebook NaN 1
1 Google 101 2020-04-01 23:01:59 Facebook NaN 1
2 Google 101 2020-04-02 12:01:41 Facebook NaN 1
3 Facebook 101 2020-04-02 13:11:33 NaN NaN 1
4 Facebook 101 2020-04-02 13:11:35 NaN NaN 1
5 Youtube 103 2020-04-21 13:01:41 Google Facebook 3
6 Youtube 103 2020-04-21 13:11:46 Google Facebook 3
7 Youtube 103 2020-04-22 01:01:01 Google Facebook 3
8 Google 103 2020-04-22 02:11:23 Facebook Youtube 3
9 Facebook 103 2020-04-23 14:11:13 Youtube NaN 3
10 Youtube 103 2020-04-23 14:11:55 NaN NaN 3
Upvotes: 1