Reputation: 2670
Say I have the following dataframe (a column of integers and a column with a list of integers)...
ID Found_IDs
0 12345 [15443, 15533, 3433]
1 15533 [2234, 16608, 12002, 7654]
2 6789 [43322, 876544, 36789]
And also a separate list of IDs...
bad_ids = [15533, 876544, 36789, 11111]
Given that, and ignoring the df['ID']
column and any index, I want to see if any of the IDs in the bad_ids
list are mentioned in the df['Found_IDs']
column. The code I have so far is:
df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
This works but only if the bad_ids
list is longer than the dataframe and for the real dataset the bad_ids
list is going to be a lot shorter than the dataframe. If I set the bad_ids
list to only two elements...
bad_ids = [15533, 876544]
I get a very popular error (I have read many questions with the same error)...
ValueError: Length of values does not match length of index
I have tried converting the list to a series (no change in the error). I have also tried adding the new column and setting all values to False
before doing the comprehension line (again no change in the error).
Two questions:
df['bad_id']
column (more useful than True/False)?Expected output for bad_ids = [15533, 876544]
:
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] True
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] True
Ideal output for bad_ids = [15533, 876544]
(ID(s) are written to a new column or columns):
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] 15533
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] 876544
Code:
import pandas as pd
result_list = [[12345,[15443,15533,3433]],
[15533,[2234,16608,12002,7654]],
[6789,[43322,876544,36789]]]
df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])
# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]
# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]
# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))
# setting up a new column of false values doesn't change things
# df['bad_id'] = False
print(df)
df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]
print(bad_ids)
print(df)
Upvotes: 12
Views: 3002
Reputation: 10970
Use explode and groupby aggregate
s = df['Found_IDs'].explode()
df['bad_ids'] = s.isin(bad_ids).groupby(s.index).any()
For bad_ids = [15533, 876544]
>>> df
ID Found_IDs bad_ids
0 12345 [15443, 15533, 3433] True
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] True
OR
For getting values matching
s = df['Found_IDs'].explode()
s.where(s.isin(bad_ids)).groupby(s.index).agg(lambda x: list(x.dropna()))
For bad_ids = [15533, 876544]
ID Found_IDs bad_ids
0 12345 [15443, 15533, 3433] [15533]
1 15533 [2234, 16608, 12002, 7654] []
2 6789 [43322, 876544, 36789] [876544]
Upvotes: 0
Reputation: 863651
If want test all values of lists in Found_IDs
column by all values of bad_ids
use:
bad_ids = [15533, 876544]
df['bad_id'] = [any(c in l for c in bad_ids) for l in df['Found_IDs']]
print (df)
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] True
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] True
If want all match:
df['bad_id'] = [[c for c in bad_ids if c in l] for l in df['Found_IDs']]
print (df)
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] [15533]
1 15533 [2234, 16608, 12002, 7654] []
2 6789 [43322, 876544, 36789] [876544]
And for first match, if empty list is set False
, possible solution, but not recommended mixing boolean and numbers:
df['bad_id'] = [next(iter([c for c in bad_ids if c in l]), False) for l in df['Found_IDs']]
print (df)
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] 15533
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] 876544
Solution with sets:
df['bad_id'] = df['Found_IDs'].map(set(bad_ids).intersection)
print (df)
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] {15533}
1 15533 [2234, 16608, 12002, 7654] {}
2 6789 [43322, 876544, 36789] {876544}
And also similar with list comprehension:
df['bad_id'] = [list(set(bad_ids).intersection(l)) for l in df['Found_IDs']]
print (df)
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] [15533]
1 15533 [2234, 16608, 12002, 7654] []
2 6789 [43322, 876544, 36789] [876544]
Upvotes: 3
Reputation: 23099
using merge
and concat
whilst grouping by your index to return all the matches.
bad_ids = [15533, 876544, 36789, 11111]
df2 = pd.concat(
[
df,
pd.merge(
df["Found_IDs"].explode().reset_index(),
pd.Series(bad_ids, name="bad_ids"),
left_on="Found_IDs",
right_on="bad_ids",
how="inner",
)
.groupby("index")
.agg(bad_ids=("bad_ids", list)),
],
axis=1,
).fillna(False)
print(df2)
ID Found_IDs bad_ids
0 12345 [15443, 15533, 3433] [15533]
1 15533 [2234, 16608, 12002, 7654] False
2 6789 [43322, 876544, 36789] [876544, 36789]
Upvotes: 1
Reputation: 42946
Using np.intersect1d
to get the intersect of the two lists:
df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))
ID Found_IDs bad_id
0 12345 [15443, 15533, 3433] [15533]
1 15533 [2234, 16608, 12002, 7654] []
2 6789 [43322, 876544, 36789] [876544]
Or with just vanilla python using intersect of sets
:
bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))
Upvotes: 9
Reputation: 4618
You can apply and use np.any:
df['bad_id'] = df['Found_IDs'].apply(lambda x: np.any([c in x for c in bad_ids]))
This return the bool if exist a bad_id in Found_IDs, if you want to retrieve this bad_ids:
df['bad_id'] = df['Found_IDs'].apply(lambda x: [*filter(lambda x: c in x, bad_ids)])
This will return a list of the bad_ids at found_ids, if there is 0 it returns []
Upvotes: 1