Eisen
Eisen

Reputation: 1887

More efficient way to use transform in pandas

I have the following dataframe:

I want to create a new column which counts how many times an IP switches domains.

Input:

    domain      ip      timestamp              next_domain    next_next_domain
0   Google      101     2020-04-01 23:01:41    Facebook       N/A
1   Google      101     2020-04-01 23:01:59    Facebook       N/A
2   Google      101     2020-04-02 12:01:41    Facebook       N/A
3   Facebook    101     2020-04-02 13:11:33    N/A            N/A
4   Facebook    101     2020-04-02 13:11:35    N/A            N/A
5   Youtube     103     2020-04-21 13:01:41    Google         Facebook
6   Youtube     103     2020-04-21 13:11:46    Google         Facebook
7   Youtube     103     2020-04-22 01:01:01    Google         Facebook
8   Google      103     2020-04-22 02:11:23    Facebook       Youtube
9   Facebook    103     2020-04-23 14:11:13    Youtube        N/A
10  Youtube     103     2020-04-23 14:11:55    N/A            N/A

I have the following code which gives me the following output:

df['switch'] = df.groupby('ip')['domain'].transform(lambda x: x.shift().ne(x).sum()-1)

Output:

    domain      ip      timestamp              next_domain    next_next_domain  switch_count
0   Google      101     2020-04-01 23:01:41    Facebook       N/A               1
1   Google      101     2020-04-01 23:01:59    Facebook       N/A               1
2   Google      101     2020-04-02 12:01:41    Facebook       N/A               1
3   Facebook    101     2020-04-02 13:11:33    N/A            N/A               1
4   Facebook    101     2020-04-02 13:11:35    N/A            N/A               1
5   Youtube     103     2020-04-21 13:01:41    Google         Facebook          3
6   Youtube     103     2020-04-21 13:11:46    Google         Facebook          3
7   Youtube     103     2020-04-22 01:01:01    Google         Facebook          3
8   Google      103     2020-04-22 02:11:23    Facebook       Youtube           3
9   Facebook    103     2020-04-23 14:11:13    Youtube        N/A               3
10  Youtube     103     2020-04-23 14:11:55    N/A            N/A               3

IP 101 switches 1 time because it goes from Google -> Facebook. IP 103 switches 3 times because it goes from Youtube -> Google -> Facebook -> Youtube.

I have the following solution using "transform" but it takes a long period of time to run on my data set. Is there a faster and more efficient solution that doesn't use transform or apply?

Upvotes: 1

Views: 91

Answers (1)

user2246849
user2246849

Reputation: 4407

I think you could calculate the difference beforehand (since anyway you subtract 1). However, this requires the df to be sorted by ip and timestamp to ensure it works correctly. By doing this outside of groupby it should be significantly faster.

df['switch_count'] = df['domain'].shift().ne(df['domain'])
df['switch_count'] = df.groupby('ip')['switch_count'].transform('sum') - 1
      domain   ip            timestamp next_domain next_next_domain  switch_count
0     Google  101  2020-04-01 23:01:41    Facebook              NaN             1
1     Google  101  2020-04-01 23:01:59    Facebook              NaN             1
2     Google  101  2020-04-02 12:01:41    Facebook              NaN             1
3   Facebook  101  2020-04-02 13:11:33         NaN              NaN             1
4   Facebook  101  2020-04-02 13:11:35         NaN              NaN             1
5    Youtube  103  2020-04-21 13:01:41      Google         Facebook             3
6    Youtube  103  2020-04-21 13:11:46      Google         Facebook             3
7    Youtube  103  2020-04-22 01:01:01      Google         Facebook             3
8     Google  103  2020-04-22 02:11:23    Facebook          Youtube             3
9   Facebook  103  2020-04-23 14:11:13     Youtube              NaN             3
10   Youtube  103  2020-04-23 14:11:55         NaN              NaN             3

Upvotes: 1

Related Questions