Reputation: 75
I have a pandas dataframe (subset constructed as below):
df1 = pd.DataFrame(data = {'Id': [12345678, 23456789, 34567890, 45678901, 56789012],
'score': [0.0123, 0.0145, 0.0345, 0.5660, 1.0000],
'target': [0, 0, 0, 1, 1]})
Id score target
0 12345678 0.0123 0
1 23456789 0.0145 0
2 34567890 0.0345 0
3 45678901 0.5660 1
4 56789012 1.0000 1
I want to add one row between each row if the score difference is 0.01 or larger. The score for the new row should then be the mean of the two compared rows and the ID and target should stay the same as the first row. The desired output is:
Id score target
0 12345678 0.0123 0
1 23456789 0.0145 0
2 23456789 0.0245 0
3 34567890 0.0345 0
4 34567890 0.30025 0
5 45678901 0.5660 1
6 45678901 0.7830 1
7 56789012 1.0000 1
I only know how to replicate/duplicate specific rows but not sure how to do this with condition and if to somehow use apply or not.
Any ideas? Thanks a lot!
Upvotes: 2
Views: 133
Reputation: 260420
You can apply rolling
+mean
to get the average of pairs of rows.
Then select the rows to keep from the initial difference between the scores.
Finally, concatenate the two dataframes and reorder.
(pd.concat([df1,
(df1.assign(score=df1['score'].rolling(2).mean().shift(-1))
.loc[df1['score'].diff(-1).abs().gt(0.01)]
)]
)
.sort_index()
.reset_index(drop=True)
)
output:
Id score target
0 12345678 0.01230 0
1 23456789 0.01450 0
2 23456789 0.02450 0
3 34567890 0.03450 0
4 34567890 0.30025 0
5 45678901 0.56600 1
6 45678901 0.78300 1
7 56789012 1.00000 1
Upvotes: 2