Python/Pandas: How to do a join on matches across multiple columns

Question

Problem:

I would like to join two tables together using a more complex joining criteria. One table has 3 different possible phone numbers, and the other table has 2 different possible phone numbers. I do not know which is the main phone number for each row. As such, I would like to join based on the criteria that a phone number from the number, Phone1, Phone2 columns in the first Table can be found in either the ANI or DNIS columns in the second Table.

Sample Data:

One DataFrame Looks like this...

                           application_uuid      number      Phone1      Phone2
0      b7754a2e-84be-4aec-a04e-0eba93dca5d8  5196942368         NaN         NaN
1      6ca3f0c3-0c83-4ebd-afe3-23977f1c6608  6475219092         NaN         NaN
2      3b5a083e-7765-4f27-941d-d2b4cbd6f26a  6476256563         NaN         NaN
3      229fee54-437f-4812-abec-7034fcb9a655        None         NaN         NaN
4      866a2cd2-5628-4e6b-b649-d92e2f0585ce  7092164418  7096391545  7092164977
5      8259410d-8d3d-4381-a0b3-6d6ce67b0917  6476387217  6476387217  6475313526
6      c359b03b-5e5f-4d4e-a5b0-ee37ac90c292        None         NaN         NaN
7      d70414a9-8fd9-4d1d-a77d-17f06743fd00  7054987969         NaN         NaN
8      0452edf9-2d58-4ad5-b1e2-0621ac517104  6136219401         NaN         NaN
9      cb3ab85c-fd42-4aff-a9b8-1743565b31e6        None         NaN         NaN
10     563e3e4d-e59a-4afc-b804-91aa14de919d  7056582202  7056582202  7056584200
11     3dd1df61-a36f-490b-ac15-225a83a21551        None  7096899998  7096899998
12     6bc42df3-e869-4794-a595-e3238ccf5284  5873415009         NaN         NaN
13     8bf11117-038f-4d2d-b4c6-9b2c6423d626  6473435642         NaN         NaN
14     0a854fe5-af66-40b0-b202-3e9367dc5a75  6478594204         NaN         NaN
15     b5884de8-2e0c-4b38-a3fd-7911cf4840b1  7787075288  7787075288  7787075288
16     f74cf212-cff0-48cc-b210-539dcdcccf72  7802676838  7806678567        None
17     9bffe5bf-b5d8-4e74-b4c9-9f1b5b238af3        None         NaN         NaN
18     dce91c00-a1ea-4111-a6ee-5ff5fd0cfb5f  6476093140         NaN         NaN
19     29cd024e-2c51-4682-b274-809c3cfb2b2b        None         NaN         NaN
20     ec55317b-fc20-416a-b26d-e95300f89c79        None         NaN         NaN
21     b3d00cd8-9d8e-415e-99b1-d8944e7b31e1        None         NaN         NaN
22     b3328787-edb7-4e08-a76c-370a74135fba        None         NaN         NaN
23     c8baf235-e702-41db-b4f8-8c2bf38109bf        None         NaN         NaN
24     cd9179bc-0594-4d25-9d7f-ddf6671777e2  7802428155         NaN         NaN
25     370855c0-b3fa-4d87-8d54-b84d34e7f35f        None         NaN         NaN
26     82244e78-3802-4890-96f6-e5267172f0e9        None         NaN         NaN
27     c7b0054c-29ac-4c76-bc5d-8cdbc93f5157  7052093358  7055268791  7052093358
28     d90e6e87-f7ef-43e1-9c85-35572fae838c  4039696044         NaN         NaN
29     bdd2474f-f4be-402b-8672-d73da90d7066        None         NaN         NaN

The other DataFrame looks like this...

        CALL ID CALL TYPE         ANI          DNIS TALK TIME
0        615262   Inbound  6479246923  8.557236e+09  00:00:00
1        615263   Inbound  5196519186  8.557236e+09  00:00:00
2        615264   Inbound  7095679350  8.557236e+09  00:00:00
3        615265   Inbound  7095679350  8.557236e+09  00:00:00
4        615266   Inbound  7095679350  8.557236e+09  00:00:00
5        615267   Inbound  7095679350  8.557236e+09  00:00:00
6        615268   Inbound  7095679350  8.557236e+09  00:00:00
7        615269   Inbound  7095679350  8.557236e+09  00:00:00
8        615270   Inbound  7095679350  8.557236e+09  00:00:00
9        615271   Inbound  7095679350  8.557236e+09  00:00:00
10       615272   Inbound  4035634231  8.557236e+09  00:00:00
11       615273   Inbound  7095679350  8.557236e+09  00:00:00
12       615274   Inbound  7095679350  8.557236e+09  00:00:00
13       615275   Inbound  7095679350  8.557236e+09  00:00:00
14       615276   Inbound  7095679350  8.557236e+09  00:00:00
15       615277   Inbound  7095679350  8.557236e+09  00:00:00
16       615278   Inbound  7095679350  8.557236e+09  00:00:00
17       615279   Inbound  9057972416  8.557236e+09  00:00:00
18       615280   Inbound  9057972416  8.557236e+09  00:00:00
19       615281   Inbound  9057972416  8.557236e+09  00:00:00
20       615282    Manual  8557235626  8.005635e+09  00:00:11
21       615283   Inbound  9057972416  8.557236e+09  00:00:00
22       615284   Inbound  4169991603  8.557236e+09  00:00:00
23       615285    Manual  8557235626  4.162977e+09  00:01:05
24       615286    Manual  8557235626  8.002569e+09  00:00:55
25       615287   Inbound  4169967207  8.557236e+09  00:07:48
26       615288   Inbound  4169788047  8.557236e+09  00:01:29
27       615289   Inbound  9057972416  8.557236e+09  00:01:39
28       615290   Inbound  8002568964  8.557236e+09  00:04:21
29       615291    Manual  8557235626  7.059751e+09  00:00:19

My Approach:

My approach was to append the phone numbers within in each row as a single column in a list. Then I created a search function. This way is not practical, inelegant and much too slow.

def f(row):
    phone_numbers_59 = phone_data['Number'].tolist()
    callid = phone_data['CALL ID'].tolist()

    get_callid = []
    for i in range(0, len(phone_numbers_59)):
        if any([x in phone_numbers_59[i] for x in row['Numbers']]):
            get_callid.append(callid[i])

    if len(get_callid) > 0:
        return get_callid
    else:
        return "NA"

s = data.apply(f, axis=1)

John Zwinck · Accepted Answer

number, Phone1, Phone2 can be found in either ANI or DNIS

It's simpler if you take the conditions one at a time (and don't write big for loops in Python, which as you've seen is slow):

for col in ('ANI', 'DNIS'):
    right = df2.set_index(col, drop=False)
    df1 = df1.join(right, 'number', rsuffix='_num_'+col)
    df1 = df1.join(right, 'Phone1', rsuffix='_p1_'+col)
    df1 = df1.join(right, 'Phone2', rsuffix='_p2_'+col)

What this does is to add columns to df1 six times: once for each combination. rsuffix serves to disambiguate the column names. You may end up with multiple matches (perhaps Phone1 matches ANI and Phone2 matches DNIS), in which case it's up to you how to resolve or combine them (perhaps using groupby()).

Python/Pandas: How to do a join on matches across multiple columns

Answers (1)

Related Questions