Mahamutha M
Mahamutha M

Reputation: 1287

Pandas : filter the rows based on a column containing lists

How to filter the rows in a data frame based on another column value?

I have a data frame which is,

ip_df:
     class    name     marks          min_marks  min_subjects
0    I        tom      [89,85,80,74]  80         2
1    II       sam      [65,72,43,40]  85         1

Based on the column values of "min_subject" and "min_marks", the row should be filtered.

The final outcome should be,

op_df:
     class    name     marks          min_marks  min_subjects flag
0    I        tom      [89,85,80,74]  80         2            1
1    II       sam      [65,72,43,40]  85         1            0

Can anyone help me to achieve the same in the data frame?

Upvotes: 5

Views: 608

Answers (2)

Mykola Zotko
Mykola Zotko

Reputation: 17794

To avoid the for loop and make full use of parallel computations you can use the new function explode (Pandas 0.25.0):

df1 = df.explode('marks')
print(df1)

Output:

  class name marks  min_marks  min_subjects
0     I  tom    89         80             2
0     I  tom    85         80             2
0     I  tom    80         80             2
0     I  tom    74         80             2
1    II  sam    65         85             1
1    II  sam    72         85             1
1    II  sam    43         85             1
1    II  sam    40         85             1

Compare the columns marks and min_marks:

df['flag'] = df1['marks'].gt(df1['min_marks'])\
.groupby(df1.index).sum().ge(df['min_subjects']).astype(int)

print(df)

Output:

  class name             marks  min_marks  min_subjects  flag
0     I  tom  [89, 85, 80, 74]         80             2     1
1    II  sam  [65, 72, 43, 40]         85             1     0

Upvotes: 1

jezrael
jezrael

Reputation: 862511

Use list comprehension with zip by 3 columns, compare each value in generator and sum for count, last compare by minimal marks and convert to integers:

df['flag'] = [1 if sum(x > c for x in a) >= b else 0 
                 for a, b, c in zip(df['marks'], df['min_subjects'], df['min_marks'])]

Alternative with convert boolean by int to 0,1:

df['flag'] = [int(sum(x > c for x in a) >= b)
                 for a, b, c in zip(df['marks'], df['min_subjects'], df['min_marks'])]

Or solution with numpy:

df['flag'] = [int(np.sum(np.array(a) > c) >= b)
                  for a, b, c in zip(df['marks'], df['min_subjects'], df['min_marks'])]

print (df)
  class name             marks  min_marks  min_subjects  flag
0     I  tom  [89, 85, 80, 74]         80             2     1
1    II  sam  [65, 72, 43, 40]         85             1     0

Upvotes: 3

Related Questions