Andrea Barnabò
Andrea Barnabò

Reputation: 574

Pandas keep highest value in every n consecutive rows

I have a pandas dataframe called df_initial with two columns 'a' and 'b' and N rows.

I would like to half the rows number, deleting the row where the value of 'b' is lower.

Thus between row 0 and row 1 I will keep row 1, between row 2 and row 3 I will keep row 3 etc..

This is the result that I would like to obtain:

print(df_initial)
         a     b
0     0.04  0.01
1     0.05  0.22
2     0.06  0.34
3     0.07  0.49
4     0.08  0.71
5     0.09  0.09
6     0.10  0.98
7     0.11  0.42
8     0.12  1.32
9     0.13  0.39
10    0.14  0.97
11    0.15  0.05
12    0.16  0.36
13    0.17  1.72
....

print(df_reduced)
         a     b
0     0.05  0.22
1     0.07  0.49
2     0.08  0.71
3     0.10  0.98
4     0.12  1.32
5     0.14  0.97
6     0.17  1.72
....

Is there some Pandas function to do this ?

I saw that there is a resample function, DataFrame.resample() , but it is valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, so not in this case.

Thanks who will help me

Upvotes: 2

Views: 3677

Answers (2)

yatu
yatu

Reputation: 88305

You can groupby every two rows (a simple way of doing so is taking the floor division of the index) and take the idxmax of column b to index the dataframe:

df.loc[df.groupby(df.index//2).b.idxmax(), :]

    a     b
0  0.05  0.22
1  0.07  0.49
2  0.09  0.71
3  0.11  0.98
4  0.13  1.32
5  0.15  0.97
6  0.17  1.72

Or using DataFrame.rolling:

df.loc[df.b.rolling(2).max()[1::2].index, :]

Upvotes: 4

Oussama Bamaàrouf
Oussama Bamaàrouf

Reputation: 149

This is an application for a simple example, you can apply it on your base.

import numpy as np
import pandas as pd
ar = np.array([[1.1, 1.0], [3.3, 0.2], [2.7, 10],[ 5.4, 7], [5.3, 9],[ 1.5, 15]])
df = pd.DataFrame(ar, columns = ['a', 'b'])



for i in range(len(df)):
    if df['b'][i] < df['a'][i]:
        df = df.drop(index = i)
print(df)````

Upvotes: 2

Related Questions