mrcw
mrcw

Reputation: 75

Pandas - duplicate row on condition and increase only one value in row

I have a pandas dataframe (subset constructed as below):

df1 = pd.DataFrame(data = {'Id': [12345678, 23456789, 34567890, 45678901, 56789012],
                           'score': [0.0123, 0.0145, 0.0345, 0.5660, 1.0000],
                           'target': [0, 0, 0, 1, 1]})
    Id          score    target
0   12345678    0.0123      0
1   23456789    0.0145      0
2   34567890    0.0345      0
3   45678901    0.5660      1
4   56789012    1.0000      1

I want to add one row between each row if the score difference is 0.01 or larger. The score for the new row should then be the mean of the two compared rows and the ID and target should stay the same as the first row. The desired output is:

    Id          score    target
0   12345678    0.0123      0
1   23456789    0.0145      0
2   23456789    0.0245      0
3   34567890    0.0345      0
4   34567890    0.30025     0
5   45678901    0.5660      1
6   45678901    0.7830      1
7   56789012    1.0000      1

I only know how to replicate/duplicate specific rows but not sure how to do this with condition and if to somehow use apply or not.

Any ideas? Thanks a lot!

Upvotes: 2

Views: 133

Answers (1)

mozway
mozway

Reputation: 260420

You can apply rolling+mean to get the average of pairs of rows.

Then select the rows to keep from the initial difference between the scores.

Finally, concatenate the two dataframes and reorder.

(pd.concat([df1,
            (df1.assign(score=df1['score'].rolling(2).mean().shift(-1))
                .loc[df1['score'].diff(-1).abs().gt(0.01)]
             )]
          )
   .sort_index()
   .reset_index(drop=True)
)

output:

         Id    score  target
0  12345678  0.01230       0
1  23456789  0.01450       0
2  23456789  0.02450       0
3  34567890  0.03450       0
4  34567890  0.30025       0
5  45678901  0.56600       1
6  45678901  0.78300       1
7  56789012  1.00000       1

Upvotes: 2

Related Questions