Sakeer
Sakeer

Reputation: 2006

compare multiple column value together using pandas

I know i can do like below if we are checking only two columns together.

df['flag'] = df['a_id'].isin(df['b_id'])

where df is a data frame, and a_id and b_id are two columns of the data frame. It will return True or False value based on the match. But i need to compare multiple columns together.

For example: if there are a_id , a_region, a_ip, b_id, b_region and b_ip columns. I want to compare like below,

enter image description here

a_key = df['a_id'] + df['a_region] + df['a_ip']
b_key = df['b_id'] + df['b_region] + df['b_ip']

df['flag'] = a_key.isin(b_key)

Somehow the above code is always returning False value. The output should be like below,

enter image description here

First row flag will be True because there is a match.

a_key becomes 2a10 this is match with last row of b_key (2a10)

Upvotes: 6

Views: 8586

Answers (3)

Rajat Jain
Rajat Jain

Reputation: 2022

You were going in the right direction, just use:

a_key = df['a_id'].astype(str) + df['a_region'] + df['a_ip'].astype(str)
b_key = df['b_id'].astype(str) + df['b_region'] + df['b_ip'].astype(str)

a_key.isin(b_key)

Mine is giving below results:

0     True
1    False
2    False

Upvotes: 9

somiandras
somiandras

Reputation: 227

You can use isin with DataFrame as value, but as per the docs:

If values is a DataFrame, then both the index and column labels must match

So this should work:

# Removing the prefixes from column names
df_a = df[['a_id', 'a_region', 'a_ip']].rename(columns=lambda x: x[2:])
df_b = df[['b_id', 'b_region', 'b_ip']].rename(columns=lambda x: x[2:])

# Find rows where all values are in the other
matched = df_a.isin(df_b).all(axis=1)

# Get actual rows with boolean indexing
df_a.loc[matched]

# ... or add boolean flag to dataframe
df['flag'] = matched

Upvotes: 3

Chris Adams
Chris Adams

Reputation: 18647

Here's one approach using DataFrame.merge, pandas.concat and testing for duplicated values:

df_merged = df.merge(df,
                     left_on=['a_id', 'a_region', 'a_ip'],
                     right_on=['b_id', 'b_region', 'b_ip'],
                     suffixes=('', '_y'))

df['flag'] = pd.concat([df, df_merged[df.columns]]).duplicated(keep=False)[:len(df)].values

[out]

    a_id a_region    a_ip     b_id b_region   b_ip   flag
0      2        a      10  3222222    sssss  22222   True
1  22222    bcccc   10000    43333    ddddd  11111  False
2  33333    acccc  120000        2        a     10  False

Upvotes: 2

Related Questions