MDR
MDR

Reputation: 2670

Add ID found in list to new column in pandas dataframe

Say I have the following dataframe (a column of integers and a column with a list of integers)...

      ID                   Found_IDs
0  12345        [15443, 15533, 3433]
1  15533  [2234, 16608, 12002, 7654]
2   6789      [43322, 876544, 36789]

And also a separate list of IDs...

bad_ids = [15533, 876544, 36789, 11111]

Given that, and ignoring the df['ID'] column and any index, I want to see if any of the IDs in the bad_ids list are mentioned in the df['Found_IDs'] column. The code I have so far is:

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

This works but only if the bad_ids list is longer than the dataframe and for the real dataset the bad_ids list is going to be a lot shorter than the dataframe. If I set the bad_ids list to only two elements...

bad_ids = [15533, 876544]

I get a very popular error (I have read many questions with the same error)...

ValueError: Length of values does not match length of index

I have tried converting the list to a series (no change in the error). I have also tried adding the new column and setting all values to False before doing the comprehension line (again no change in the error).

Two questions:

  1. How do I get my code (below) to work for a list that is shorter than a dataframe?
  2. How would I get the code to write the actual ID found back to the df['bad_id'] column (more useful than True/False)?

Expected output for bad_ids = [15533, 876544]:

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

Ideal output for bad_ids = [15533, 876544] (ID(s) are written to a new column or columns):

      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    876544

Code:

import pandas as pd

result_list = [[12345,[15443,15533,3433]],
        [15533,[2234,16608,12002,7654]],
        [6789,[43322,876544,36789]]]

df = pd.DataFrame(result_list,columns=['ID','Found_IDs'])

# works if list has four elements
# bad_ids = [15533, 876544, 36789, 11111]

# fails if list has two elements (less elements than the dataframe)
# ValueError: Length of values does not match length of index
bad_ids = [15533, 876544]

# coverting to Series doesn't change things
# bad_ids = pd.Series(bad_ids)
# print(type(bad_ids))

# setting up a new column of false values doesn't change things
# df['bad_id'] = False

print(df)

df['bad_id'] = [c in l for c, l in zip(bad_ids, df['Found_IDs'])]

print(bad_ids)

print(df)

Upvotes: 12

Views: 3002

Answers (5)

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10970

Use explode and groupby aggregate

s = df['Found_IDs'].explode()
df['bad_ids'] = s.isin(bad_ids).groupby(s.index).any()

For bad_ids = [15533, 876544]

>>> df
      ID                   Found_IDs  bad_ids
0  12345        [15443, 15533, 3433]     True
1  15533  [2234, 16608, 12002, 7654]    False
2   6789      [43322, 876544, 36789]     True

OR

For getting values matching

s = df['Found_IDs'].explode()
s.where(s.isin(bad_ids)).groupby(s.index).agg(lambda x: list(x.dropna()))

For bad_ids = [15533, 876544]

      ID                   Found_IDs   bad_ids
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

Upvotes: 0

jezrael
jezrael

Reputation: 863651

If want test all values of lists in Found_IDs column by all values of bad_ids use:

bad_ids = [15533, 876544]

df['bad_id'] = [any(c in l for c in bad_ids) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]    True
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]    True

If want all match:

df['bad_id'] = [[c for c in bad_ids if c in l] for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

And for first match, if empty list is set False, possible solution, but not recommended mixing boolean and numbers:

df['bad_id'] = [next(iter([c for c in bad_ids if c in l]), False) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs  bad_id
0  12345        [15443, 15533, 3433]   15533
1  15533  [2234, 16608, 12002, 7654]   False
2   6789      [43322, 876544, 36789]  876544

Solution with sets:

df['bad_id'] = df['Found_IDs'].map(set(bad_ids).intersection)
print (df)

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   {15533}
1  15533  [2234, 16608, 12002, 7654]        {}
2   6789      [43322, 876544, 36789]  {876544}

And also similar with list comprehension:

df['bad_id'] = [list(set(bad_ids).intersection(l)) for l  in df['Found_IDs']]
print (df)
      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

Upvotes: 3

Umar.H
Umar.H

Reputation: 23099

using merge and concat whilst grouping by your index to return all the matches.

bad_ids = [15533, 876544, 36789, 11111]

df2 = pd.concat(
    [
        df,
        pd.merge(
            df["Found_IDs"].explode().reset_index(),
            pd.Series(bad_ids, name="bad_ids"),
            left_on="Found_IDs",
            right_on="bad_ids",
            how="inner",
        )
        .groupby("index")
        .agg(bad_ids=("bad_ids", list)),
    ],
    axis=1,
).fillna(False)
print(df2)


      ID                   Found_IDs          bad_ids
0  12345        [15443, 15533, 3433]          [15533]
1  15533  [2234, 16608, 12002, 7654]            False
2   6789      [43322, 876544, 36789]  [876544, 36789]

Upvotes: 1

Erfan
Erfan

Reputation: 42946

Using np.intersect1d to get the intersect of the two lists:

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.intersect1d(x, bad_ids))

      ID                   Found_IDs    bad_id
0  12345        [15443, 15533, 3433]   [15533]
1  15533  [2234, 16608, 12002, 7654]        []
2   6789      [43322, 876544, 36789]  [876544]

Or with just vanilla python using intersect of sets:

bad_ids_set = set(bad_ids)
df['Found_IDs'].apply(lambda x: list(set(x) & bad_ids_set))

Upvotes: 9

Bruno Mello
Bruno Mello

Reputation: 4618

You can apply and use np.any:

df['bad_id'] = df['Found_IDs'].apply(lambda x: np.any([c in x for c in bad_ids]))

This return the bool if exist a bad_id in Found_IDs, if you want to retrieve this bad_ids:

df['bad_id'] = df['Found_IDs'].apply(lambda x: [*filter(lambda x: c in x, bad_ids)])

This will return a list of the bad_ids at found_ids, if there is 0 it returns []

Upvotes: 1

Related Questions