Kasper
Kasper

Reputation: 121

Find unique values between two columns

I have been going through various questions, but haven't found one that fits to this case.

I have two columns with emails. The first column(CollectedE) consists of 32000 and the second column(UndE) consists of 14987.

I need to find all emails in the second column, which does not exist in the first column and output them into a completely new column.

I have tried something like this, but that doesn't work because of two different lengths in the columns.

import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', delimiter=";")

df['is_dup'] = df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] =[![enter image description here][1]][1] df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE'])

df

Here is a picture of the two columns, if that helps. But it seems like all other cases are about either remove duplicates in one column, delete rows with the same values, find frequencies or similar.

enter image description here

But I hope you can help. Thank you!

Upvotes: 0

Views: 340

Answers (4)

Sumit Diwakar
Sumit Diwakar

Reputation: 46

Here is something I've implemented. I've utilized right outer join and converted output column in a list and appended it in source dataframe.

#Creating dataframe
df = pd.DataFrame({'col1': ['x', 'y', 'z', 'x1'], 'col2': ['x', 'x2', 'y', np.nan]})

#Applying right join and keeping values which are present in 2nd column only
df2 = pd.merge(df[['col1']], df[['col2']], how = 'outer', left_on = ['col1'], right_on 
= ['col2'], indicator = True)

df2 = df2[df2['_merge'] == 'right_only'][['col2']]

To maintain same length of dataframe, null values are added.

#Creating list and adding it as column in source dataframe
df2_list = df2.append(pd.DataFrame({'col2': [np.nan for x in range(len(df) - 
len(df2))]}))['col2'].to_list()

df['col3'] = df2_list

Output:

df
    col1 col2 col3
0    x    x   x2
1    y   x2  NaN
2    z    y  NaN
3   x1  NaN  NaN

You can convert column of list before as well and extend the list with null values.

Upvotes: 1

Matthew Borish
Matthew Borish

Reputation: 3096

Here is a working example using the index difference method and a merge.

df = pd.DataFrame({'column_a':['cat','dog','bird','fish','zebra','snake'],
               'column_b':['leopard','snake','bird','sloth','elephant','dolphin']})

idx1 = pd.Index(df['column_a'])
idx2 = pd.Index(df['column_b'])

x = pd.Series(idx2.difference(idx1), name='non_matching_values')

df.merge(x, how='left', left_on='column_b', right_on=x.values)

column_a    column_b    non_matching_values
0   cat leopard leopard
1   dog snake   NaN
2   bird    bird    NaN
3   fish    sloth   sloth
4   zebra   elephant    elephant
5   snake   dolphin dolphin

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

you can use isin which is quite simple with ~ to invert the operation.

df = pd.DataFrame({'CollectedE' : ['[email protected]','[email protected]'],
             'UndE' : ['[email protected]','[email protected]']})

df['new_col'] = df[~df['CollectedE'].isin(df['UndE'])]['UndE']

print(df)
          CollectedE                   UndE                new_col
0      [email protected]          [email protected]                    NaN
1  [email protected]  [email protected]  [email protected]

Upvotes: 1

gtancev
gtancev

Reputation: 253

Maybe pandas.Index.difference can help you.

Upvotes: 1

Related Questions