Reputation: 751
Finding this one difficult and would like to see if you can help, pls:
I have two df
's, dfA
and dfB
. In each there is a telnum
, file
, and datetime
column, but only dfA
contains a name
column and only dfA
has complete telephone number string values unlike dfB
which sometimes has partially complete telephone numbers up to x
of n
digits, e.g. sometimes the int'l calling code is missing, other times a '0' is present instead of the int'l calling code, and other times neither int'l calling code or the leading '0' is present.
dfA
also contains more data than dfB
(<10 rows), but between the two there are complete timestamp
, file
, and telnum
columns always with a date-time and string value respectively (even though dfB['telnum']
is not fully complete as stated above).
What I want to do is extract the rows from dfA
where dfB['telnum']
matches dfA['telnum']
but because dfB['telnum']
is not always complete then I need to check for matches as a substring of dfA['telnum']
too.
I would like the result to be dfResult
but where the returned results have dfA
values on the left and dfB
on the right so that I can see the different file
and datetime
values.
Any ideas?
EDIT:
I think I need an inner merge, e.g.
pandas.merge(dfA, dfB, on='telnum', how='inner')
However, because dfB['telnum']
isn't always complete telnum
string, the results are not complete. How can I get those that match between the two by checking if dfB['telnum']
is a substring of dfA['telnum']
as well?
Upvotes: 0
Views: 169
Reputation: 6639
Since you have not provided any sample input or output, based on the description, I am making an attempt here. This is a multi fold problem and there can be many approaches depending on how your phone numbers look like.
Let me have two dataframes here (skipping your other columns): dfA
and dfB
.
dfA:
telnum
0 0049123456789
1 00919444454555
2 0092789742893
dfB:
telnum
0 123456789
1 09444454555
First, let's clean up the numbers in dfB
. dfB
can have numbers which are:
00918888888888
08888888888
8888888888
I am stripping off every number that starts with 0. So anything that has an international code 0091
becomes 091
, or has 091
and becomes 91
, and has 0888888888
becomes 888888888
. What this does is, since your dfA has the complete number with ISD code which I assume is off the form 00xxNNNNNNNN
- format, a substring match should be found easily.
dfB['telnum'] = dfB['telnum'].apply(lambda x : x[1:] if x.startswith("0") else x)
dfB now looks like (notice the first occuring 0 has been stripped):
telnum
0 123456789
1 9444454555
Take all the numbers from dfB and convert them to a list:
tempList = dfB['telnum'].tolist()
Now look for the match in the other dataframe:
dfA[dfA.telnum.str.contains('|'.join(tempList))]
You should now get:
telnum
0 0049123456789
1 00919444454555
So you have found those numbers for which you got a match. You can now proceed with these rows and perform the necessary join with other columns or anything you want.
Upvotes: 1
Reputation: 131
Before doing any complicated joins you need to clean your data. I'm not sure how it is in your country, but in my country, if you remove the international prefix, or the leading zeroes, a phone number becomes 9 digits long.
Here's what I suggest you do:
def remove_non_numbers(string):
result = ""
for char in string:
if char.isnumeric():
result += char
return result
# Do this for both dataframes just in case
df["telnum"] = df["telnum"].apply(remove_non_numbers)
def limit9(string):
if len(string) > 9:
return string[-9:]
return string
# Do this for both dataframes
df["telnum"] = df["telnum"].apply(limit9)
dfResult = pd.merge(dfA, dfB, on='telnum', how='inner')
Good luck! Let me know if it works.
Upvotes: 1