user1717931
user1717931

Reputation: 2501

pandas row selection based on column subsets

I have a Pandas dataframe with 4 columns - like below:

          A                       B            C           D
2         c  {4889, 9978, 1230, 4921}        {30}         4
4         m  {4889, 9978, 1230, 4921}        {30}         4
0         a        {4889, 1230, 4921}        {30}         3
7         q              {1240, 4921}        {30}         2
9         x              {9978, 1230}        {30}         2

Also, I have a list like this:

[[1230,4889],[1240, 4921]]

I want to select those rows from the dataframe where the column B values are supersets of any of the list items. For the given example, the output would be:

          A                       B            C           D
2         c  {4889, 9978, 1230, 4921}        {30}         4
4         m  {4889, 9978, 1230, 4921}        {30}         4
0         a        {4889, 1230, 4921}        {30}         3
7         q              {1240, 4921}        {30}         2

any nice way to do it? it is not as straight-forward as doing something like:

df.loc[df['B'] == 'xyz']

Upvotes: 1

Views: 54

Answers (2)

piRSquared
piRSquared

Reputation: 294488

Use numpy broadcasting with set operations. Note: >= for sets returns the truth value of wether the right side is a subset of the left side. The equality portion allows for equal sets.


s = np.array([set(l) for l in [[1230, 4889], [1240, 4921]]])

m = (df['B'].values >= s[:, None]).any(0)

df[m]

   A                         B     C  D
2  c  {4889, 9978, 1230, 4921}  {30}  4
4  m  {4889, 9978, 1230, 4921}  {30}  4
0  a        {4889, 1230, 4921}  {30}  3
7  q              {1240, 4921}  {30}  2

Upvotes: 2

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

A naive approach, simply checking for containment of each of the elements of your list.

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("""          A                       B            C           D
    2         c  {4889, 9978, 1230, 4921}        {30}         4
    4         m  {4889, 9978, 1230, 4921}        {30}         4
    0         a        {4889, 1230, 4921}        {30}         3
    7         q              {1240, 4921}        {30}         2
    9         x              {9978, 1230}        {30}         2""",
    ), sep=r'  +')

df['B'] = df.B.apply(eval)

lst = [[1230,4889],[1240, 4921]]
lst_sets = [set(l) for l in lst]

df_res = df[df.B.apply(lambda s: any(not(s2 - s) for s2 in lst_sets))]

Upvotes: 0

Related Questions