Pandas extract rows from df where df['col'] values match df2['col'] values

Question

Finding this one difficult and would like to see if you can help, pls:

I have two df's, dfA and dfB. In each there is a telnum, file, and datetime column, but only dfA contains a name column and only dfA has complete telephone number string values unlike dfB which sometimes has partially complete telephone numbers up to x of n digits, e.g. sometimes the int'l calling code is missing, other times a '0' is present instead of the int'l calling code, and other times neither int'l calling code or the leading '0' is present.

dfA also contains more data than dfB (<10 rows), but between the two there are complete timestamp, file, and telnum columns always with a date-time and string value respectively (even though dfB['telnum'] is not fully complete as stated above).

What I want to do is extract the rows from dfA where dfB['telnum'] matches dfA['telnum'] but because dfB['telnum'] is not always complete then I need to check for matches as a substring of dfA['telnum'] too.

I would like the result to be dfResult but where the returned results have dfA values on the left and dfB on the right so that I can see the different file and datetime values.

Any ideas?

EDIT:

I think I need an inner merge, e.g.

pandas.merge(dfA, dfB, on='telnum', how='inner')

However, because dfB['telnum'] isn't always complete telnum string, the results are not complete. How can I get those that match between the two by checking if dfB['telnum'] is a substring of dfA['telnum'] as well?

scr · Accepted Answer

Before doing any complicated joins you need to clean your data. I'm not sure how it is in your country, but in my country, if you remove the international prefix, or the leading zeroes, a phone number becomes 9 digits long.

Here's what I suggest you do:

Clean your 'telnum' columns:

def remove_non_numbers(string):
    result = ""
    for char in string:
        if char.isnumeric():
            result += char
    return result

# Do this for both dataframes just in case
df["telnum"] = df["telnum"].apply(remove_non_numbers)

Remove any digits before the last 9:

def limit9(string):
    if len(string) > 9:
        return string[-9:]
    return string

# Do this for both dataframes
df["telnum"] = df["telnum"].apply(limit9)

Now you can try to merge your cleaned DFs:

dfResult = pd.merge(dfA, dfB, on='telnum', how='inner')

Good luck! Let me know if it works.

Pandas extract rows from df where df['col'] values match df2['col'] values

Answers (2)

Related Questions

Pandas extract rows from df where df[&#39;col&#39;] values match df2[&#39;col&#39;] values

Answers (2)

Related Questions

Pandas extract rows from df where df['col'] values match df2['col'] values