Reputation: 81
I have a dataframe, user_df, with ~500,000 rows with the following format:
| id | other_ids |
|------|--------------|
| 1 |['abc', efg'] |
| 2 |['bbb'] |
| 3 |['ccc', 'ddd']|
I also have a list, other_ids_that_clicked, with ~5000 items full of other ids:
['abc', 'efg', 'ccc']
I'm looking to de-dupe other_ids_that_clicked using user_df by adding another column in df for when a value in other_ids is in user_df['other_ids'] as such:
| id | other_ids | clicked |
|------|--------------|-----------|
| 1 |['abc', efg'] | 1 |
| 2 |['bbb'] | 0 |
| 3 |['ccc', 'ddd']| 1 |
The way I'm checking is by looping through other_ids_that_clicked for each row in user_df.
def otheridInList(row):
isin = False
for other_id in other_ids_that_clicked:
if other_id in row['other_ids']:
isin = True
break
else:
isin = False
if isin:
return 1
else:
return 0
This is taking forever, so I was looking for suggestions on best ways to approach this.
Thanks!
Upvotes: 3
Views: 2541
Reputation: 323226
Using set
df['New']=(df.other_ids.apply(set)!=(df.other_ids.apply(set)-set(l))).astype(int)
df
Out[114]:
id other_ids New
0 1 [abc, efg] 1
1 2 [bbb] 0
2 3 [ccc, ddd] 1
Upvotes: 3
Reputation: 402483
You can actually speed this up quite a bit. Take out the column, convert it into its own dataframe, and use df.isin
to do some checking -
l = ['abc', 'efg', 'ccc']
df['clicked'] = pd.DataFrame(df.other_ids.tolist()).isin(l).any(1).astype(int)
id other_ids clicked
0 1 [abc, efg] 1
1 2 [bbb] 0
2 3 [ccc, ddd] 1
Details
First, convert other_ids
into a list of lists -
i = df.other_ids.tolist()
i
[['abc', 'efg'], ['bbb'], ['ccc', 'ddd']]
Now, load it into a new dataframe -
j = pd.DataFrame(i)
j
0 1
0 abc efg
1 bbb None
2 ccc ddd
Perform checks with isin
-
k = j.isin(l)
k
0 1
0 True True
1 False False
2 True False
clicked
can be computed by checking if True
is present in any row, with df.any
. The result is converted into an integer.
k.any(1).astype(int)
0 1
1 0
2 1
dtype: int64
Upvotes: 6