Reputation: 2785
Problem:
I would like to join two tables together using a more complex joining criteria. One table has 3 different possible phone numbers, and the other table has 2 different possible phone numbers. I do not know which is the main phone number for each row. As such, I would like to join based on the criteria that a phone number from the number
, Phone1
, Phone2
columns in the first Table can be found in either the ANI
or DNIS
columns in the second Table.
Sample Data:
One DataFrame Looks like this...
application_uuid number Phone1 Phone2
0 b7754a2e-84be-4aec-a04e-0eba93dca5d8 5196942368 NaN NaN
1 6ca3f0c3-0c83-4ebd-afe3-23977f1c6608 6475219092 NaN NaN
2 3b5a083e-7765-4f27-941d-d2b4cbd6f26a 6476256563 NaN NaN
3 229fee54-437f-4812-abec-7034fcb9a655 None NaN NaN
4 866a2cd2-5628-4e6b-b649-d92e2f0585ce 7092164418 7096391545 7092164977
5 8259410d-8d3d-4381-a0b3-6d6ce67b0917 6476387217 6476387217 6475313526
6 c359b03b-5e5f-4d4e-a5b0-ee37ac90c292 None NaN NaN
7 d70414a9-8fd9-4d1d-a77d-17f06743fd00 7054987969 NaN NaN
8 0452edf9-2d58-4ad5-b1e2-0621ac517104 6136219401 NaN NaN
9 cb3ab85c-fd42-4aff-a9b8-1743565b31e6 None NaN NaN
10 563e3e4d-e59a-4afc-b804-91aa14de919d 7056582202 7056582202 7056584200
11 3dd1df61-a36f-490b-ac15-225a83a21551 None 7096899998 7096899998
12 6bc42df3-e869-4794-a595-e3238ccf5284 5873415009 NaN NaN
13 8bf11117-038f-4d2d-b4c6-9b2c6423d626 6473435642 NaN NaN
14 0a854fe5-af66-40b0-b202-3e9367dc5a75 6478594204 NaN NaN
15 b5884de8-2e0c-4b38-a3fd-7911cf4840b1 7787075288 7787075288 7787075288
16 f74cf212-cff0-48cc-b210-539dcdcccf72 7802676838 7806678567 None
17 9bffe5bf-b5d8-4e74-b4c9-9f1b5b238af3 None NaN NaN
18 dce91c00-a1ea-4111-a6ee-5ff5fd0cfb5f 6476093140 NaN NaN
19 29cd024e-2c51-4682-b274-809c3cfb2b2b None NaN NaN
20 ec55317b-fc20-416a-b26d-e95300f89c79 None NaN NaN
21 b3d00cd8-9d8e-415e-99b1-d8944e7b31e1 None NaN NaN
22 b3328787-edb7-4e08-a76c-370a74135fba None NaN NaN
23 c8baf235-e702-41db-b4f8-8c2bf38109bf None NaN NaN
24 cd9179bc-0594-4d25-9d7f-ddf6671777e2 7802428155 NaN NaN
25 370855c0-b3fa-4d87-8d54-b84d34e7f35f None NaN NaN
26 82244e78-3802-4890-96f6-e5267172f0e9 None NaN NaN
27 c7b0054c-29ac-4c76-bc5d-8cdbc93f5157 7052093358 7055268791 7052093358
28 d90e6e87-f7ef-43e1-9c85-35572fae838c 4039696044 NaN NaN
29 bdd2474f-f4be-402b-8672-d73da90d7066 None NaN NaN
The other DataFrame looks like this...
CALL ID CALL TYPE ANI DNIS TALK TIME
0 615262 Inbound 6479246923 8.557236e+09 00:00:00
1 615263 Inbound 5196519186 8.557236e+09 00:00:00
2 615264 Inbound 7095679350 8.557236e+09 00:00:00
3 615265 Inbound 7095679350 8.557236e+09 00:00:00
4 615266 Inbound 7095679350 8.557236e+09 00:00:00
5 615267 Inbound 7095679350 8.557236e+09 00:00:00
6 615268 Inbound 7095679350 8.557236e+09 00:00:00
7 615269 Inbound 7095679350 8.557236e+09 00:00:00
8 615270 Inbound 7095679350 8.557236e+09 00:00:00
9 615271 Inbound 7095679350 8.557236e+09 00:00:00
10 615272 Inbound 4035634231 8.557236e+09 00:00:00
11 615273 Inbound 7095679350 8.557236e+09 00:00:00
12 615274 Inbound 7095679350 8.557236e+09 00:00:00
13 615275 Inbound 7095679350 8.557236e+09 00:00:00
14 615276 Inbound 7095679350 8.557236e+09 00:00:00
15 615277 Inbound 7095679350 8.557236e+09 00:00:00
16 615278 Inbound 7095679350 8.557236e+09 00:00:00
17 615279 Inbound 9057972416 8.557236e+09 00:00:00
18 615280 Inbound 9057972416 8.557236e+09 00:00:00
19 615281 Inbound 9057972416 8.557236e+09 00:00:00
20 615282 Manual 8557235626 8.005635e+09 00:00:11
21 615283 Inbound 9057972416 8.557236e+09 00:00:00
22 615284 Inbound 4169991603 8.557236e+09 00:00:00
23 615285 Manual 8557235626 4.162977e+09 00:01:05
24 615286 Manual 8557235626 8.002569e+09 00:00:55
25 615287 Inbound 4169967207 8.557236e+09 00:07:48
26 615288 Inbound 4169788047 8.557236e+09 00:01:29
27 615289 Inbound 9057972416 8.557236e+09 00:01:39
28 615290 Inbound 8002568964 8.557236e+09 00:04:21
29 615291 Manual 8557235626 7.059751e+09 00:00:19
My Approach:
My approach was to append the phone numbers within in each row as a single column in a list. Then I created a search function. This way is not practical, inelegant and much too slow.
def f(row):
phone_numbers_59 = phone_data['Number'].tolist()
callid = phone_data['CALL ID'].tolist()
get_callid = []
for i in range(0, len(phone_numbers_59)):
if any([x in phone_numbers_59[i] for x in row['Numbers']]):
get_callid.append(callid[i])
if len(get_callid) > 0:
return get_callid
else:
return "NA"
s = data.apply(f, axis=1)
Upvotes: 0
Views: 601
Reputation: 249394
number
,Phone1
,Phone2
can be found in eitherANI
orDNIS
It's simpler if you take the conditions one at a time (and don't write big for
loops in Python, which as you've seen is slow):
for col in ('ANI', 'DNIS'):
right = df2.set_index(col, drop=False)
df1 = df1.join(right, 'number', rsuffix='_num_'+col)
df1 = df1.join(right, 'Phone1', rsuffix='_p1_'+col)
df1 = df1.join(right, 'Phone2', rsuffix='_p2_'+col)
What this does is to add columns to df1
six times: once for each combination. rsuffix
serves to disambiguate the column names. You may end up with multiple matches (perhaps Phone1
matches ANI
and Phone2
matches DNIS
), in which case it's up to you how to resolve or combine them (perhaps using groupby()
).
Upvotes: 1