Reputation: 67

Faster way to do conditional slicing on a Pandas dataframe containing a column of array

Let's take this dataframe that has a column of arrays:

In:  df = pd.DataFrame([['one', np.array([1,2,3,4])], 
                        ['two', np.array([1,3])], 
                        ['three', np.array([0,2,4])]],
                       columns=['id', 'items'])

Out:
      id         items
0    one  [1, 2, 3, 4]
1    two        [1, 3]
2  three     [0, 2, 4]

If I want to filter by an element being in 'items' I would do:

In: df[ df['items'].apply(lambda x: 2 in x)] 

Out:
       id         items
 1    one  [1, 2, 3, 4]
 2  three     [0, 2, 4]

However, this method is extremely slow and my dataframe is very large. Is there any faster way to iterate through the elements in 'items'?

Upvotes: 3

Answers (3)

BENY

Reputation: 323226

IIUC

m = pd.DataFrame(df['items'].tolist()).isin([2]).any(1)
Out[70]: 
0     True
1    False
2     True
dtype: bool
df1 = df[m].copy()

And we can try

[2 in x for x in df['items']]
Out[81]: [True, False, True]

Upvotes: 1

yatu

Reputation: 88236

Using sets you can check if a given number (2 here) is a set.subset the lists:

df[df['items'].agg({2}.issubset)]

     id         items
0    one  [1, 2, 3, 4]
2  three     [0, 2, 4]

Timings on a large dataframe:

df_large = pd.concat([df]*100_000, axis=0, ignore_index=True)

%timeit df_large[df_large['items'].agg({2}.issubset)]
# 355 ms ± 3.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit  pd.DataFrame(df_large['items'].tolist()).isin([2]).any(1)
# 564 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_large[df_large['items'].explode().eq(2).any(level=0)]
# 658 ms ± 6.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 4

anky

Reputation: 75080

You can try explode (new in pandas 0.25.0) with df.any

df[df['items'].explode().eq(2).any(level=0)]

      id         items
0    one  [1, 2, 3, 4]
2  three     [0, 2, 4]

Upvotes: 2

Faster way to do conditional slicing on a Pandas dataframe containing a column of array

Answers (3)

Related Questions