R__
R__

Reputation: 117

How to duplicate a row if a condition is matched?

I have the following dataset

    trade_id    start_date  end_date    factset_entity_id   product_id  product_name    l6_id   multi_assign_flag
0   100000191   2017    NaT 0711GY-E    99996362    Fan Milk - FanYogo  5.015152e+11    0.0
1   100000193   2017    2018    0711GY-E    99996413    Fan Milk - FanDango/Frozen FanDango 5.015151e+11    0.0
2   100000193   2018    2022    0711GY-E    99996413    Fan Milk - FanDango 5.015151e+11    0.0
3   100000193   2022    NaT 0711GY-E    99996413    Fan Milk - FanDango 5.015151e+11    0.0
4   100000231   2016    2018    05NC3S-E    59979529    Ci:z Holdings - Dr. Ci:Labo 5.020102e+11    1.0

My goal is to duplicate the rows in which start_date and end_date are different, for instance

1   100000193   2017    2018    0711GY-E    99996413    Fan Milk - FanDango/Frozen FanDango 5.015151e+11    0.0

should be

2017    100000193   0711GY-E    99996413    Fan Milk - FanDango/Frozen FanDango 5.015151e+11    0.0
2018    100000193   0711GY-E    99996413    Fan Milk - FanDango/Frozen FanDango 5.015151e+11    0.0

Do you have any idea about what could I do here without using loops? Thanks

I try to do it using simple loops and it's fine, but it's slow. I wonder if there are shortcuts in pandas that ease it.

Upvotes: 1

Views: 41

Answers (1)

Laurent
Laurent

Reputation: 13458

With the following toy dataframe in which rows 0 and 2 have different start and end dates:

import pandas as pd

df = pd.DataFrame(
    {
        "trade_id": [100000191, 100000192, 100000193, 100000194, 100000231],
        "start_date": [2017, 2017, 2018, 2022, 2018],
        "end_date": [None, 2017, 2022, 2022, 2018],
        "factset_entity_id": [
            "0711GY-E",
            "0711GY-E",
            "0711GY-E",
            "0711GY-E",
            "05NC3S-E",
        ],
        "product_id": [99996362, 99996413, 99996414, 99996415, 59979529],
    }
)

Here is one way to do it using pd.concat:

df = pd.concat([df, df.loc[df["start_date"] != df["end_date"], :]]).sort_index()

Then, you can see that rows 0 and 2 are now duplicated:

print(df)
# Output
    trade_id  start_date  end_date factset_entity_id  product_id
0  100000191        2017       NaN          0711GY-E    99996362
0  100000191        2017       NaN          0711GY-E    99996362
1  100000192        2017    2017.0          0711GY-E    99996413
2  100000193        2018    2022.0          0711GY-E    99996414
2  100000193        2018    2022.0          0711GY-E    99996414
3  100000194        2022    2022.0          0711GY-E    99996415
4  100000231        2018    2018.0          05NC3S-E    59979529

Upvotes: 1

Related Questions