Reputation: 29
I have a large pandas DataFrames
like below.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
("1", "Dixon Street", "Auckland"),
("2", "Deep Creek Road", "Wellington"),
("3", "Lyon St", "Melbourne"),
("4", "Hongmian Street", "Quinxin"),
("5", "Kadawatha Road", "Ganemulla"),
],
columns=("ad_no", "street", "city"),
)
And I have a second large pandas DataFrame
as below.
dfa = pd.DataFrame(
[
("1 Dixon Street", "Auckland"),
("2 Deep Creek Road", "Wellington"),
("3 Lyon St", "Melbourne"),
("4 Hongmian Street", "Quinxin"),
("5 Federal Street", "Porac City"),
],
columns=("address", "city"),
)
I want to check street string in df
is available in dfa
using str.contains function. I am particularly interested in not matching ones (e.g, Kadawatha Road)Can someone please let me know how to do that? Thanks
I tried the following code. But, it doesn't provide any results.
for a in df['street']:
dfa[dfa['address'].str.contains(a, case=False)]
Upvotes: 1
Views: 71
Reputation: 159
you could also merge the columns to get the full adress inlcuding the city and drop the duplicates.
something like:
df['full_add'] = df['ad_no'] + " " + df['street'] + " "+df['city']
dfa['full_add'] = dfa['address'] + " " +dfa['city']
pd.concat([df,dfa]).drop_duplicates('full_add', keep=False)
that will produce:
ad_no street city full_add address
4 5 Kadawatha Road Ganemulla 5 Kadawatha Road Ganemulla NaN
4 NaN NaN Porac City 5 Federal Street Porac City 5 Federal Street
Upvotes: 0
Reputation: 271
Another solution would be to concat the values and use regular expression.
dfa['address'].str.contains(df['street'].str.cat(sep='|'), regex=True)
But this is not very performant for large data sets.
Upvotes: 1
Reputation: 1789
As @LMC mentioned, you can use a string contains method, though this might be slow.
I might add a helper column
df['is_matched'] = df['street'].apply(lambda x: dfa['address'].str.contains(x).any())
And then use a filter
not_matched_df = df[~df['is_matched']].drop(columns=['is_matched'])
There are some other options/libraries. For example you could try a fuzzy match to do something similar:
%pip install thefuzz
from thefuzz import process
threshold = 80 # Set a similarity threshold
df['match'] = df['street'].apply(lambda x: process.extractOne(x, dfa['address'], score_cutoff=threshold))
not_matched_df = df[df['match'].isnull()].drop(columns=['match'])
Upvotes: 1