pandas row selection based on column subsets

Question

I have a Pandas dataframe with 4 columns - like below:

          A                       B            C           D
2         c  {4889, 9978, 1230, 4921}        {30}         4
4         m  {4889, 9978, 1230, 4921}        {30}         4
0         a        {4889, 1230, 4921}        {30}         3
7         q              {1240, 4921}        {30}         2
9         x              {9978, 1230}        {30}         2

Also, I have a list like this:

[[1230,4889],[1240, 4921]]

I want to select those rows from the dataframe where the column B values are supersets of any of the list items. For the given example, the output would be:

          A                       B            C           D
2         c  {4889, 9978, 1230, 4921}        {30}         4
4         m  {4889, 9978, 1230, 4921}        {30}         4
0         a        {4889, 1230, 4921}        {30}         3
7         q              {1240, 4921}        {30}         2

any nice way to do it? it is not as straight-forward as doing something like:

df.loc[df['B'] == 'xyz']

piRSquared · Accepted Answer

Use numpy broadcasting with set operations. Note: >= for sets returns the truth value of wether the right side is a subset of the left side. The equality portion allows for equal sets.

s = np.array([set(l) for l in [[1230, 4889], [1240, 4921]]])

m = (df['B'].values >= s[:, None]).any(0)

df[m]

   A                         B     C  D
2  c  {4889, 9978, 1230, 4921}  {30}  4
4  m  {4889, 9978, 1230, 4921}  {30}  4
0  a        {4889, 1230, 4921}  {30}  3
7  q              {1240, 4921}  {30}  2

pandas row selection based on column subsets

Answers (2)

Related Questions