Stphn
Stphn

Reputation: 117

Checking if list in Python dataframe contains specific value while ignoring NaNs

Let's assume my dataframe's second column contains lists of integers:

df = pd.DataFrame({"col_1":[1,2,3,4,5],"col_2":[[1,2],np.nan,[3,5,9],[2],[8,5]],"col_3":np.nan})

Output:

   col_1      col_2  col_3
0      1     [1, 2]    NaN
1      2        NaN    NaN
2      3  [3, 5, 9]    NaN
3      4        [2]    NaN
4      5     [8, 5]    NaN

I'd like to insert a 1 in column 3 if the int in column 1 can be found in in the list of ints in column 2:

   col_1      col_2  col_3
0      1     [1, 2]    1
1      2        NaN    NaN
2      3  [3, 5, 9]    1
3      4        [2]    NaN
4      5     [8, 5]    1

I was trying to solve it like that:

for i in range(0,len(df)):
    if df["col_1"][i] in df["col_2"][i]:
        df["col_3"][i]=1

This gave me TypeError: argument of type 'float' is not iterable because of the NaN in the column 2 and I couldn't work out a way to deal with it.

(I've tried to solve this using a diffent solution based on .isin but this wouldn't work because AttributeError: 'list' object has no attribute 'isin'.)

I then had the idea to replace all the NaN in col_2 with a 0 so that my inital for loop would be able to run through. There are no 0 in col_1 and will never be, so I'd be fine with that solution because this won't lead to wrong matches in col_3. To this end,

df.loc[df["col_2"].isnull(), "col_2"] = 0

is not enough because the if in can't deal with ints: TypeError: argument of type 'int' is not iterable. I would need the 0 do be inserted as an element of a list, but you can't just use =[0] instead. I've tried different things based on .at because it should be able to insert lists into cells, but I couldn't work it out.

Thank you in advance for any advice!

Upvotes: 2

Views: 328

Answers (2)

jezrael
jezrael

Reputation: 863166

You can filter out NaNs with if-else with x['col_2'] == x['col_2'] because NaN != NaN is False, alsofor convert Trues values to 1 is used map by dictionary, False values are not in dict, so returned NaNs:

f = lambda x: x['col_1'] in x['col_2'] if x['col_2'] == x['col_2'] else np.nan
df['col_3'] = df.apply(f, 1).map({True:1})
print (df)
   col_1      col_2  col_3
0      1     [1, 2]    1.0
1      2        NaN    NaN
2      3  [3, 5, 9]    1.0
3      4        [2]    NaN
4      5     [8, 5]    1.0

Or use DataFrame.dropna for remove NaNs rows and after assign back new column are added removed NaNs:

f = lambda x: x['col_1'] in x['col_2']
df['col_3'] = df.dropna(subset=['col_1', 'col_2']).apply(f, 1).map({True:1})
print (df)
   col_1      col_2  col_3
0      1     [1, 2]    1.0
1      2        NaN    NaN
2      3  [3, 5, 9]    1.0
3      4        [2]    NaN
4      5     [8, 5]    1.0

Upvotes: 3

BENY
BENY

Reputation: 323326

Use

s=df.dropna(subset=['col_2','col_1'])
df['new']=pd.DataFrame(s.col_2.tolist(),index=s.index).isin(df.col_1).sum(1).loc[lambda x : x!=0]
df
   col_1      col_2  col_3  new
0      1     [1, 2]    NaN  1.0
1      2        NaN    NaN  NaN
2      3  [3, 5, 9]    NaN  1.0
3      4        [2]    NaN  NaN
4      5     [8, 5]    NaN  1.0

Upvotes: 1

Related Questions