More efficient way to use transform in pandas

Question

I have the following dataframe:

I want to create a new column which counts how many times an IP switches domains.

Input:

    domain      ip      timestamp              next_domain    next_next_domain
0   Google      101     2020-04-01 23:01:41    Facebook       N/A
1   Google      101     2020-04-01 23:01:59    Facebook       N/A
2   Google      101     2020-04-02 12:01:41    Facebook       N/A
3   Facebook    101     2020-04-02 13:11:33    N/A            N/A
4   Facebook    101     2020-04-02 13:11:35    N/A            N/A
5   Youtube     103     2020-04-21 13:01:41    Google         Facebook
6   Youtube     103     2020-04-21 13:11:46    Google         Facebook
7   Youtube     103     2020-04-22 01:01:01    Google         Facebook
8   Google      103     2020-04-22 02:11:23    Facebook       Youtube
9   Facebook    103     2020-04-23 14:11:13    Youtube        N/A
10  Youtube     103     2020-04-23 14:11:55    N/A            N/A

I have the following code which gives me the following output:

df['switch'] = df.groupby('ip')['domain'].transform(lambda x: x.shift().ne(x).sum()-1)

Output:

    domain      ip      timestamp              next_domain    next_next_domain  switch_count
0   Google      101     2020-04-01 23:01:41    Facebook       N/A               1
1   Google      101     2020-04-01 23:01:59    Facebook       N/A               1
2   Google      101     2020-04-02 12:01:41    Facebook       N/A               1
3   Facebook    101     2020-04-02 13:11:33    N/A            N/A               1
4   Facebook    101     2020-04-02 13:11:35    N/A            N/A               1
5   Youtube     103     2020-04-21 13:01:41    Google         Facebook          3
6   Youtube     103     2020-04-21 13:11:46    Google         Facebook          3
7   Youtube     103     2020-04-22 01:01:01    Google         Facebook          3
8   Google      103     2020-04-22 02:11:23    Facebook       Youtube           3
9   Facebook    103     2020-04-23 14:11:13    Youtube        N/A               3
10  Youtube     103     2020-04-23 14:11:55    N/A            N/A               3

IP 101 switches 1 time because it goes from Google -> Facebook. IP 103 switches 3 times because it goes from Youtube -> Google -> Facebook -> Youtube.

I have the following solution using "transform" but it takes a long period of time to run on my data set. Is there a faster and more efficient solution that doesn't use transform or apply?

More efficient way to use transform in pandas

Answers (1)

Related Questions