Viet Pm
Viet Pm

Reputation: 317

Using vectorization in Pandas when in each row you need to use the whole data to compare

I have a contact data below.

    mobile      email           contact_code index_clone contact_day
0   0972135314  [email protected]   1            0           9/26/2018
1   0972135314  [email protected]   2            1           9/26/2018
2   0943360092  [email protected]   3            2           9/25/2018
3   0988870766  [email protected]   4            3           9/24/2018
4   0972135314  [email protected]   5            4           9/27/2018

I need to create a new column called "user". User column get data from contact code, if contacts have the same mobile or the same email, they are the same user, and the user value for these contacts is the smallest contact_code.

For example: Row 4: mobile is 0972135314 so it matches row 0 and 1, email is [email protected] so it matches row 2. The contact_code for all 3 matched row is {1,2,3}, so user column of row 4 is 1, the smallest value.

The problem is, I need to loop through each row to find matched rows, get the list of contact_code and assign the smallest value to user column. So Pandas runs slowly. How can I make the process faster ? Here my code.

for idx in data.index:
    data.set_value(idx,"user",min(data.loc[(data["index_clone"] <= idx) & ((data["mobile"] == data.get_value(idx,"mobile")) | (data["email"] == data.get_value(idx,"email"))),"contact_code"]))

*Update: Thank w-m and jpp for your answers. But I have another problem. Every row have a contact_day column, it's the day of the contact. For example, there are 3 contacts row of user 1 (row 0, 1 and 4). Row 0's contact_day and row 1's contact day is 09/26/2018, and row 4's contact_day is 09/27/2018. So row 0 and 1 "turn" column is 1, and row 4 "turn" is 2. How can I get the turn of the contact of the user without looping through every row ?

Upvotes: 3

Views: 107

Answers (2)

w-m
w-m

Reputation: 11232

You could separately get the minimum contact_code for each group of mobiles, then for each group of emails:

min_mobile_cc = df.groupby("mobile").contact_code.transform(np.min)
min_email_cc = df.groupby("email").contact_code.transform(np.min)

For each row in the data, this stores either their original contact_code, or the minimum contact code of any of the other rows in the same mobile/email group.

With this preparation you can then choose the global minimum contact code between mobile and email groups:

df["user"] = np.minimum(min_mobile_cc, min_email_cc)

Upvotes: 2

jpp
jpp

Reputation: 164783

One way is to sort your dataframe by descending contact_code and create a couple of dictionary mappings. Then use these mappings to derive the correct contact_code.

This works because during dictionary construction values for keys are overwritten by later assignments. You are only interested in the minimum mappings, which is applied via our initial sorting.

df_rev = df.sort_values('contact_code', ascending=False)
d1 = df_rev.set_index('mobile')['contact_code'].to_dict()
d2 = df_rev.set_index('email')['contact_code'].to_dict()

df['contact_code_new'] = np.minimum(df['mobile'].map(d1), df['email'].map(d2))

print(df)

      mobile          email  contact_code  index_clone  contact_code_new
0  972135314  [email protected]             1            0                 1
1  972135314  [email protected]             2            1                 1
2  943360092  [email protected]             3            2                 3
3  988870766  [email protected]             4            3                 4
4  972135314  [email protected]             5            4                 1

Upvotes: 0

Related Questions