Reputation: 169

Comparing two lists, one of them can change (depending tha DataFrame) and add a new column with the results

The lists ( 2 o more) to compare with the DataFrame (column A) can change depending tha data of the column B

    df = pd.DataFrame({'A': [['10', '20', '30', '40'],['50', '60', '70', '80'],['90','100','110','120'],'B': ['X','Y','Z']})
    
    findA = ['10','25', '20','35']
    findB = ['60','85]
        .
        .
        .
    findX
    
    
                      A  B
0      [10, 20, 30, 40]  X
1      [50, 60, 70, 80]  Y
2   [90, 100, 110, 120]  Z

This will be the desired behavior

If the data of column B is 'X' then compare findA list with column A and add the diference (findA) to the column C

If the data of column B is 'Y' then compare findB list with column A and add the diference (findB) to the column C

else NaN

                      A  B              C
0      [10, 20, 30, 40]  X    ['25','35']
1      [50, 60, 70, 80]  Y         ['85']
2   [90, 100, 110, 120]  Z            NaN

Thanks in advance

Upvotes: 0

Answers (2)

jezrael

Reputation: 862761

For improve performance is filtered rows and processing only matched values by masks, then is created new column by DataFrame.loc with difference of sets:

m1 = df['B'].eq('X')
df.loc[m1, 'C'] =  df.loc[m1, 'A'].apply(lambda x: list(set(findA) - set(x)))
m2 = df['B'].eq('Y')
df.loc[m2, 'C'] =  df.loc[m2, 'A'].apply(lambda x: list(set(findB) - set(x)))

print (df)
                     A  B         C
0     [10, 20, 30, 40]  X  [35, 25]
1     [50, 60, 70, 80]  Y      [85]
2  [90, 100, 110, 120]  Z       NaN

More dynamic solution:

d = {'X': findA, 'Y': findB}
   
for k, v in d.items():      
    m = df['B'].eq(k)
    df.loc[m, 'C'] =  df.loc[m, 'A'].apply(lambda x: list(set(v) - set(x)))

print (df)
                     A  B         C
0     [10, 20, 30, 40]  X  [35, 25]
1     [50, 60, 70, 80]  Y      [85]
2  [90, 100, 110, 120]  Z       NaN

Solution with numpy.select is possible, but here apply processing all values, so if large data it should be slowier:

m1 = df['B'].eq('X')
m2 = df['B'].eq('Y')
a =  df['A'].apply(lambda x: list(set(findA) - set(x)))
b =  df['A'].apply(lambda x: list(set(findB) - set(x)))

df['C'] = np.select([m1, m2], [a, b], np.nan)
print (df)
                     A  B         C
0     [10, 20, 30, 40]  X  [35, 25]
1     [50, 60, 70, 80]  Y      [85]
2  [90, 100, 110, 120]  Z       NaN

Upvotes: 1

Serge Ballesta

Reputation: 148965

You could first build a temporary dataframe to link the findx lists to the values of column B and merge it into df:

tmp = pd.DataFrame({'B': ['X', 'Y', 'Z'], 'find': [findA, findB, findC]})

df['C'] = df.merge(tmp, on='B').apply(lambda x: list(set(x['find']) - set(x['A'])), axis=1)

With a reasonable findC it gives:

                     A  B         C
0     [10, 20, 30, 40]  X  [35, 25]
1     [50, 60, 70, 80]  Y      [85]
2  [90, 100, 110, 120]  Z        []

If you really need a NaN instead of the empty list, just do:

df.loc[df['C'].apply(len) == 0, 'C'] = np.nan

Upvotes: 1

Comparing two lists, one of them can change (depending tha DataFrame) and add a new column with the results

Answers (2)

Related Questions