Reputation: 121
I have been going through various questions, but haven't found one that fits to this case.
I have two columns with emails. The first column(CollectedE) consists of 32000 and the second column(UndE) consists of 14987.
I need to find all emails in the second column, which does not exist in the first column and output them into a completely new column.
I have tried something like this, but that doesn't work because of two different lengths in the columns.
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv', delimiter=";")
df['is_dup'] = df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE']).is_dup.transform(np.sum)
# df outputs:
df['is_dup'] =[![enter image description here][1]][1] df[['CollectedE', 'UndE']].duplicated()
df['dups'] = df.groupby(['CollectedE', 'UndE'])
df
Here is a picture of the two columns, if that helps. But it seems like all other cases are about either remove duplicates in one column, delete rows with the same values, find frequencies or similar.
But I hope you can help. Thank you!
Upvotes: 0
Views: 340
Reputation: 46
Here is something I've implemented. I've utilized right outer join and converted output column in a list and appended it in source dataframe.
#Creating dataframe
df = pd.DataFrame({'col1': ['x', 'y', 'z', 'x1'], 'col2': ['x', 'x2', 'y', np.nan]})
#Applying right join and keeping values which are present in 2nd column only
df2 = pd.merge(df[['col1']], df[['col2']], how = 'outer', left_on = ['col1'], right_on
= ['col2'], indicator = True)
df2 = df2[df2['_merge'] == 'right_only'][['col2']]
To maintain same length of dataframe, null values are added.
#Creating list and adding it as column in source dataframe
df2_list = df2.append(pd.DataFrame({'col2': [np.nan for x in range(len(df) -
len(df2))]}))['col2'].to_list()
df['col3'] = df2_list
Output:
df
col1 col2 col3
0 x x x2
1 y x2 NaN
2 z y NaN
3 x1 NaN NaN
You can convert column of list before as well and extend the list with null values.
Upvotes: 1
Reputation: 3096
Here is a working example using the index difference method and a merge.
df = pd.DataFrame({'column_a':['cat','dog','bird','fish','zebra','snake'],
'column_b':['leopard','snake','bird','sloth','elephant','dolphin']})
idx1 = pd.Index(df['column_a'])
idx2 = pd.Index(df['column_b'])
x = pd.Series(idx2.difference(idx1), name='non_matching_values')
df.merge(x, how='left', left_on='column_b', right_on=x.values)
column_a column_b non_matching_values
0 cat leopard leopard
1 dog snake NaN
2 bird bird NaN
3 fish sloth sloth
4 zebra elephant elephant
5 snake dolphin dolphin
Upvotes: 1
Reputation: 23099
you can use isin
which is quite simple with ~
to invert the operation.
df = pd.DataFrame({'CollectedE' : ['[email protected]','[email protected]'],
'UndE' : ['[email protected]','[email protected]']})
df['new_col'] = df[~df['CollectedE'].isin(df['UndE'])]['UndE']
print(df)
CollectedE UndE new_col
0 [email protected] [email protected] NaN
1 [email protected] [email protected] [email protected]
Upvotes: 1