user8766186
user8766186

Reputation: 81

Python: Efficiently check if value in a list is in another list

I have a dataframe, user_df, with ~500,000 rows with the following format:

|  id  |  other_ids   |
|------|--------------|
|  1   |['abc', efg'] |
|  2   |['bbb']       |
|  3   |['ccc', 'ddd']|

I also have a list, other_ids_that_clicked, with ~5000 items full of other ids:

 ['abc', 'efg', 'ccc']

I'm looking to de-dupe other_ids_that_clicked using user_df by adding another column in df for when a value in other_ids is in user_df['other_ids'] as such:

|  id  |  other_ids   |  clicked  |
|------|--------------|-----------|
|  1   |['abc', efg'] |     1     |
|  2   |['bbb']       |     0     |
|  3   |['ccc', 'ddd']|     1     |

The way I'm checking is by looping through other_ids_that_clicked for each row in user_df.

def otheridInList(row):
  isin = False
  for other_id in other_ids_that_clicked:
    if other_id in row['other_ids']:
        isin = True
        break
    else: 
        isin = False
  if isin:
    return 1
  else:
    return 0

This is taking forever, so I was looking for suggestions on best ways to approach this.

Thanks!

Upvotes: 3

Views: 2541

Answers (2)

BENY
BENY

Reputation: 323226

Using set

df['New']=(df.other_ids.apply(set)!=(df.other_ids.apply(set)-set(l))).astype(int)
df
Out[114]: 
   id   other_ids  New
0   1  [abc, efg]    1
1   2       [bbb]    0
2   3  [ccc, ddd]    1

Upvotes: 3

cs95
cs95

Reputation: 402483

You can actually speed this up quite a bit. Take out the column, convert it into its own dataframe, and use df.isin to do some checking -

l = ['abc', 'efg', 'ccc']
df['clicked'] = pd.DataFrame(df.other_ids.tolist()).isin(l).any(1).astype(int)

   id   other_ids  clicked
0   1  [abc, efg]        1
1   2       [bbb]        0
2   3  [ccc, ddd]        1

Details

First, convert other_ids into a list of lists -

i = df.other_ids.tolist()

i
[['abc', 'efg'], ['bbb'], ['ccc', 'ddd']]

Now, load it into a new dataframe -

j = pd.DataFrame(i)

j
     0     1
0  abc   efg
1  bbb  None
2  ccc   ddd

Perform checks with isin -

k = j.isin(l)

k
       0      1
0   True   True
1  False  False
2   True  False

clicked can be computed by checking if True is present in any row, with df.any. The result is converted into an integer.

k.any(1).astype(int)

0    1
1    0
2    1
dtype: int64

Upvotes: 6

Related Questions