Find str.contains in two large Pandas DataFrames

I have a large pandas DataFrames like below.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ("1", "Dixon Street", "Auckland"),
        ("2", "Deep Creek Road", "Wellington"),
        ("3", "Lyon St", "Melbourne"),
        ("4", "Hongmian Street", "Quinxin"),
        ("5", "Kadawatha Road", "Ganemulla"),
    ],
    
    columns=("ad_no", "street", "city"),
)

And I have a second large pandas DataFrame as below.

dfa = pd.DataFrame(
    [
        ("1 Dixon Street", "Auckland"),
        ("2 Deep Creek Road", "Wellington"),
        ("3 Lyon St", "Melbourne"),
        ("4 Hongmian Street", "Quinxin"),
        ("5 Federal Street", "Porac City"),
    ],
    
    columns=("address", "city"),
)

I want to check street string in df is available in dfa using str.contains function. I am particularly interested in not matching ones (e.g, Kadawatha Road)Can someone please let me know how to do that? Thanks

I tried the following code. But, it doesn't provide any results.

for a in df['street']:
  dfa[dfa['address'].str.contains(a, case=False)]

Upvotes: 1

Answers (3)

AlexVI

Reputation: 159

you could also merge the columns to get the full adress inlcuding the city and drop the duplicates.
something like:

df['full_add'] = df['ad_no'] + " " + df['street'] + " "+df['city']
dfa['full_add'] = dfa['address'] + " " +dfa['city']
pd.concat([df,dfa]).drop_duplicates('full_add', keep=False)

that will produce:

ad_no   street  city    full_add    address
4   5   Kadawatha Road  Ganemulla   5 Kadawatha Road Ganemulla  NaN
4   NaN NaN Porac City  5 Federal Street Porac City 5 Federal Street

Upvotes: 0

Jens

Reputation: 271

Another solution would be to concat the values and use regular expression.

dfa['address'].str.contains(df['street'].str.cat(sep='|'), regex=True)

But this is not very performant for large data sets.

Upvotes: 1

Docuemada

Reputation: 1789

As @LMC mentioned, you can use a string contains method, though this might be slow.

I might add a helper column

df['is_matched'] = df['street'].apply(lambda x: dfa['address'].str.contains(x).any())

And then use a filter

not_matched_df = df[~df['is_matched']].drop(columns=['is_matched'])

There are some other options/libraries. For example you could try a fuzzy match to do something similar:

%pip install thefuzz
from thefuzz import process
threshold = 80  # Set a similarity threshold
df['match'] = df['street'].apply(lambda x: process.extractOne(x, dfa['address'], score_cutoff=threshold))
not_matched_df = df[df['match'].isnull()].drop(columns=['match'])

Upvotes: 1

Find str.contains in two large Pandas DataFrames

Answers (3)

Related Questions