Reputation: 2501
I have a Pandas dataframe with 4 columns - like below:
A B C D
2 c {4889, 9978, 1230, 4921} {30} 4
4 m {4889, 9978, 1230, 4921} {30} 4
0 a {4889, 1230, 4921} {30} 3
7 q {1240, 4921} {30} 2
9 x {9978, 1230} {30} 2
Also, I have a list like this:
[[1230,4889],[1240, 4921]]
I want to select those rows from the dataframe where the column B values are supersets of any of the list items. For the given example, the output would be:
A B C D
2 c {4889, 9978, 1230, 4921} {30} 4
4 m {4889, 9978, 1230, 4921} {30} 4
0 a {4889, 1230, 4921} {30} 3
7 q {1240, 4921} {30} 2
any nice way to do it? it is not as straight-forward as doing something like:
df.loc[df['B'] == 'xyz']
Upvotes: 1
Views: 54
Reputation: 294488
Use numpy
broadcasting with set operations.
Note: >=
for sets returns the truth value of wether the right side is a subset of the left side. The equality portion allows for equal sets.
s = np.array([set(l) for l in [[1230, 4889], [1240, 4921]]])
m = (df['B'].values >= s[:, None]).any(0)
df[m]
A B C D
2 c {4889, 9978, 1230, 4921} {30} 4
4 m {4889, 9978, 1230, 4921} {30} 4
0 a {4889, 1230, 4921} {30} 3
7 q {1240, 4921} {30} 2
Upvotes: 2
Reputation: 11602
A naive approach, simply checking for containment of each of the elements of your list.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(""" A B C D
2 c {4889, 9978, 1230, 4921} {30} 4
4 m {4889, 9978, 1230, 4921} {30} 4
0 a {4889, 1230, 4921} {30} 3
7 q {1240, 4921} {30} 2
9 x {9978, 1230} {30} 2""",
), sep=r' +')
df['B'] = df.B.apply(eval)
lst = [[1230,4889],[1240, 4921]]
lst_sets = [set(l) for l in lst]
df_res = df[df.B.apply(lambda s: any(not(s2 - s) for s2 in lst_sets))]
Upvotes: 0