Reputation: 67
Let's take this dataframe that has a column of arrays:
In: df = pd.DataFrame([['one', np.array([1,2,3,4])],
['two', np.array([1,3])],
['three', np.array([0,2,4])]],
columns=['id', 'items'])
Out:
id items
0 one [1, 2, 3, 4]
1 two [1, 3]
2 three [0, 2, 4]
If I want to filter by an element being in 'items' I would do:
In: df[ df['items'].apply(lambda x: 2 in x)]
Out:
id items
1 one [1, 2, 3, 4]
2 three [0, 2, 4]
However, this method is extremely slow and my dataframe is very large. Is there any faster way to iterate through the elements in 'items'?
Upvotes: 3
Views: 894
Reputation: 323226
IIUC
m = pd.DataFrame(df['items'].tolist()).isin([2]).any(1)
Out[70]:
0 True
1 False
2 True
dtype: bool
df1 = df[m].copy()
And we can try
[2 in x for x in df['items']]
Out[81]: [True, False, True]
Upvotes: 1
Reputation: 88236
Using sets
you can check if a given number (2
here) is a set.subset
the lists:
df[df['items'].agg({2}.issubset)]
id items
0 one [1, 2, 3, 4]
2 three [0, 2, 4]
Timings on a large dataframe:
df_large = pd.concat([df]*100_000, axis=0, ignore_index=True)
%timeit df_large[df_large['items'].agg({2}.issubset)]
# 355 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.DataFrame(df_large['items'].tolist()).isin([2]).any(1)
# 564 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_large[df_large['items'].explode().eq(2).any(level=0)]
# 658 ms ± 6.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 4