Reputation: 574
I have a pandas dataframe called df_initial
with two columns 'a' and 'b' and N rows.
I would like to half the rows number, deleting the row where the value of 'b' is lower.
Thus between row 0 and row 1 I will keep row 1, between row 2 and row 3 I will keep row 3 etc..
This is the result that I would like to obtain:
print(df_initial)
a b
0 0.04 0.01
1 0.05 0.22
2 0.06 0.34
3 0.07 0.49
4 0.08 0.71
5 0.09 0.09
6 0.10 0.98
7 0.11 0.42
8 0.12 1.32
9 0.13 0.39
10 0.14 0.97
11 0.15 0.05
12 0.16 0.36
13 0.17 1.72
....
print(df_reduced)
a b
0 0.05 0.22
1 0.07 0.49
2 0.08 0.71
3 0.10 0.98
4 0.12 1.32
5 0.14 0.97
6 0.17 1.72
....
Is there some Pandas function to do this ?
I saw that there is a resample function, DataFrame.resample()
, but it is valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, so not in this case.
Thanks who will help me
Upvotes: 2
Views: 3677
Reputation: 88305
You can groupby
every two rows (a simple way of doing so is taking the floor division of the index) and take the idxmax
of column b
to index the dataframe:
df.loc[df.groupby(df.index//2).b.idxmax(), :]
a b
0 0.05 0.22
1 0.07 0.49
2 0.09 0.71
3 0.11 0.98
4 0.13 1.32
5 0.15 0.97
6 0.17 1.72
Or using DataFrame.rolling
:
df.loc[df.b.rolling(2).max()[1::2].index, :]
Upvotes: 4
Reputation: 149
This is an application for a simple example, you can apply it on your base.
import numpy as np
import pandas as pd
ar = np.array([[1.1, 1.0], [3.3, 0.2], [2.7, 10],[ 5.4, 7], [5.3, 9],[ 1.5, 15]])
df = pd.DataFrame(ar, columns = ['a', 'b'])
for i in range(len(df)):
if df['b'][i] < df['a'][i]:
df = df.drop(index = i)
print(df)````
Upvotes: 2