Pandas DataFrame Apply Efficiency

Question

I have a dataframe to which I wan't to add a column with a kind of status if there is a matching value in another dataframe. I have the current code which works:

df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))

I know the line is ugly, but I get the impression that its inefficient. Can you suggest a better way to make this comparison?

Scott Boston · Accepted Answer

You can use np.where, isin, and isnull:

Create some dummy data:

np.random.seed(123)
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)})
df.iloc[4] = np.nan #Create missing data
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})

Do matching with np.where:

df['NewColumn']  = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))

Output:

    ComparisonColumn NewColumn
0               12.0          
1               12.0          
2               16.0   Matched
3               11.0          
4                NaN   Missing
5               19.0   Matched
6               16.0   Matched
7               11.0          
8               10.0          
9               11.0          
10              19.0   Matched
11              10.0          
12              10.0          
13              19.0   Matched
14              13.0          
15              14.0          
16              10.0          
17              10.0          
18              14.0          
19              11.0

Pandas DataFrame Apply Efficiency

Answers (1)

Related Questions