Using vectorization in Pandas when in each row you need to use the whole data to compare

Question

I have a contact data below.

    mobile      email           contact_code index_clone contact_day
0   0972135314  abc@gmail.com   1            0           9/26/2018
1   0972135314  cde@gmail.com   2            1           9/26/2018
2   0943360092  cmt@gmail.com   3            2           9/25/2018
3   0988870766  def@gmail.com   4            3           9/24/2018
4   0972135314  cmt@gmail.com   5            4           9/27/2018

I need to create a new column called "user". User column get data from contact code, if contacts have the same mobile or the same email, they are the same user, and the user value for these contacts is the smallest contact_code.

For example: Row 4: mobile is 0972135314 so it matches row 0 and 1, email is cmt@gmail.com so it matches row 2. The contact_code for all 3 matched row is {1,2,3}, so user column of row 4 is 1, the smallest value.

The problem is, I need to loop through each row to find matched rows, get the list of contact_code and assign the smallest value to user column. So Pandas runs slowly. How can I make the process faster ? Here my code.

for idx in data.index:
    data.set_value(idx,"user",min(data.loc[(data["index_clone"] <= idx) & ((data["mobile"] == data.get_value(idx,"mobile")) | (data["email"] == data.get_value(idx,"email"))),"contact_code"]))

*Update: Thank w-m and jpp for your answers. But I have another problem. Every row have a contact_day column, it's the day of the contact. For example, there are 3 contacts row of user 1 (row 0, 1 and 4). Row 0's contact_day and row 1's contact day is 09/26/2018, and row 4's contact_day is 09/27/2018. So row 0 and 1 "turn" column is 1, and row 4 "turn" is 2. How can I get the turn of the contact of the user without looping through every row ?

w-m · Accepted Answer

You could separately get the minimum contact_code for each group of mobiles, then for each group of emails:

min_mobile_cc = df.groupby("mobile").contact_code.transform(np.min)
min_email_cc = df.groupby("email").contact_code.transform(np.min)

For each row in the data, this stores either their original contact_code, or the minimum contact code of any of the other rows in the same mobile/email group.

With this preparation you can then choose the global minimum contact code between mobile and email groups:

df["user"] = np.minimum(min_mobile_cc, min_email_cc)

Using vectorization in Pandas when in each row you need to use the whole data to compare

Answers (2)

Related Questions