Reputation: 13
I'm new to python Pandas. I faced a problem to find the difference for 2 lists within a Pandas DataFrame.
Example Input with ;
separator:
ColA; ColB
A,B,C,D; B,C,D
A,C,E,F; A,C,F
Expected Output:
ColA; ColB; ColC
A,B,C,D; B,C,D; A
A,C,E,F; A,C,F; E
What I want to do is similar to:
df['ColC'] = np.setdiff1d(df['ColA'].str.split(','), df['ColB'].str.split(','))
But it returns an error:
raise ValueError('Length of values does not match length of index',data,index,len(data),len(index))
Kindly advise
Upvotes: 1
Views: 1609
Reputation: 729
A lot faster way to do this would be a simple set subtract:
import pandas as pd
#Creating a dataframe
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
#Finding the difference
df['ColC']= df['ColA'].map(set)-df['ColB'].map(set)
As the dataframe grows in row numbers, it will be computationally pretty expensive to do any row by row operation.
Upvotes: 0
Reputation: 2913
You can apply a lambda function on the DataFrame to find the difference like this:
import pandas as pd
# creating DataFrame (can also be loaded from a file)
df = pd.DataFrame([[['A','B','C','D'], ['B','C']]], columns=['ColA','ColB'])
# apply a lambda function to get the difference
df['ColC'] = df[['ColA','ColB']].apply(lambda x: [i for i in x[0] if i not in x[1]], axis=1)
Please notice! this will find the asymmetric difference ColA - ColB
Result:
Upvotes: 2