Stan
Stan

Reputation: 884

Maximal Subset of Pandas Column based on a CutoFF

I am having an algoritmic problem which I am trying to solve in python. I have a pandas dataframe ( say) of two columns as: ( I have it kept it sorted in descending here to make it easier to explain the problem)

  df:
           ACOL       BCOL  

           LA1        234
           LA2        230
           LA3        220
           LA4        218
           LA5        210
           LA6        200
           LA7        185
           LA8        180
           LA9        150
           LA10       100

I have a threshold value of BCOL, say 215. So what I want is to get the maximal subset from the above pandas dataframe, which when I take the average of BCOL will give me greater than or equal to 215.

So in this case, if I keep the BCOL values upto 200 then the mean of (234, 230,... 200) is 218.67, whereas if I keep up to 185 ( 234, 230, ..., 200, 185), the mean is 213.86. So my maximal subset to get the BCOL mean greater than 215 should be from ( 234,... 200). So I will drop the rest of the rows. So my final output pandas dataframe should be :

 dfnew:
           ACOL       BCOL  

           LA1        234
           LA2        230
           LA3        220
           LA4        218
           LA5        210
           LA6        200

I was trying to put the BCOL into a list and trying a for/while loop, but it is not pythonic and also a bit time consuming for very large data table. Is there a way in pandas to achieve this more pythonic way.

Will appreciate any help. Thanks.

Upvotes: 0

Views: 176

Answers (1)

Dani Mesejo
Dani Mesejo

Reputation: 61910

IIUC, you could do:

# guarantee that the DF is sorted by non ascending
df = df.sort_values(by=['BCOL'], ascending=False)

# cumulative mean, then find where is gt 215
mask = (df['BCOL'].cumsum() / np.arange(1, len(df) + 1)) > 215.0

print(df[mask])

Output

  ACOL  BCOL
0  LA1   234
1  LA2   230
2  LA3   220
3  LA4   218
4  LA5   210
5  LA6   200

Upvotes: 1

Related Questions