karanrajpal14
karanrajpal14

Reputation: 83

Splitting DataFrame rows into multiple based on a condition

I have a Dataframe df1 that has a bunch of columns like so:

val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
2 30 90 1/1/2020 6/4/2021 50 60
3 40 100 12/5/2020 7/4/2021 70 80
4 89 300 4/5/2020 6/8/2022 40 10

I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:

val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 12/31/2020 30 40
2 20 80 1/1/2021 3/6/2021 30 40
3 30 90 1/1/2020 12/31/2020 50 60
4 30 90 1/1/2021 6/4/2021 50 60
5 40 100 7/5/2021 11/17/2021 70 80
6 89 300 4/5/2020 12/31/2020 40 10
7 89 300 1/1/2021 12/31/2021 40 10
8 89 300 1/1/2021 6/8/2022 40 10

Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!

EDIT

If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!

Upvotes: 1

Views: 188

Answers (2)

ddejohn
ddejohn

Reputation: 8952

Here's a different approach. Note that I've already converted start and end to datetimes, and I didn't bother sorting the resultant DataFrame because I didn't want to assume a specific ordering for your use-case.

import pandas as pd


def jump_to_new_year(df: pd.DataFrame) -> pd.DataFrame:
    df["start"] = df["start"].map(lambda t: pd.Timestamp(t.year + 1, 1, 1))
    return df


def fill_to_year_end(df: pd.DataFrame) -> pd.DataFrame:
    df["end"] = df["start"].map(lambda t: pd.Timestamp(t.year, 12, 31))
    return df


def roll_over(df: pd.DataFrame) -> pd.DataFrame:
    mask = df.start.dt.year != df.end.dt.year
    if all(~mask):
        return df
    start_df = fill_to_year_end(df[mask].copy())
    end_df = roll_over(jump_to_new_year(df[mask].copy()))
    return pd.concat([df[~mask], start_df, end_df]).reset_index(drop=True)

This is a recursive function. It first checks if any start-end date pairs have mismatched years. If not, we simply return the DataFrame. If so, we fill to the end of the year in the start_df DataFrame. Then we jump to the new year and fill that to the end date in the end_df DataFrame. Then we recurse on end_df, which will be a smaller subset of the original input.

Warning: this solution assumes that all start dates occur on or before the end date's year. If you start in 2020 and end in 2019, you will recurse infinitely and blow the stack.

Demo:

>>> df
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2021-03-06     30     40
2     30     90 2020-01-01 2021-06-04     50     60
3     40    100 2020-12-05 2021-07-04     70     80
4     89    300 2020-04-05 2022-06-08     40     10

>>> roll_over(df)
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2020-12-31     30     40
2     30     90 2020-01-01 2020-12-31     50     60
3     40    100 2020-12-05 2020-12-31     70     80
4     89    300 2020-04-05 2020-12-31     40     10
5     20     80 2021-01-01 2021-03-06     30     40
6     30     90 2021-01-01 2021-06-04     50     60
7     40    100 2021-01-01 2021-07-04     70     80
8     89    300 2021-01-01 2021-12-31     40     10
9     89    300 2022-01-01 2022-06-08     40     10

# An example of reordering the DataFrame
>>> roll_over(df).sort_values(by=["val_1", "start"])
   val_1  val_2      start        end  val_3  val_4
0     10     70 2020-01-01 2020-03-04     10     20
1     20     80 2020-01-01 2020-12-31     30     40
5     20     80 2021-01-01 2021-03-06     30     40
2     30     90 2020-01-01 2020-12-31     50     60
6     30     90 2021-01-01 2021-06-04     50     60
3     40    100 2020-12-05 2020-12-31     70     80
7     40    100 2021-01-01 2021-07-04     70     80
4     89    300 2020-04-05 2020-12-31     40     10
8     89    300 2021-01-01 2021-12-31     40     10
9     89    300 2022-01-01 2022-06-08     40     10

Upvotes: 2

BENY
BENY

Reputation: 323226

Find the year end after date_range, then explode

df['end'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]: 
   val_1  val_2      start       end  val_3  val_4
0     10     70   1/1/2020  3/4/2020     10     20
1     20     80   1/1/2020  3/6/2021     30     40
1     20     80   1/1/2020  12/31/20     30     40
2     30     90   1/1/2020  6/4/2021     50     60
2     30     90   1/1/2020  12/31/20     50     60
3     40    100  12/5/2020  7/4/2021     70     80
3     40    100  12/5/2020  12/31/20     70     80

Update

df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x]+pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
                  for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
                  for x , y in zip(df['start'],df['end'])]

out = df.explode(['Newend','Newstart'])
val_1  val_2      start        end  val_3  val_4   Newstart     Newend
0     10     70 2020-01-01 2020-03-04     10     20 2020-01-01 2020-03-04
1     20     80 2020-01-01 2021-03-06     30     40 2021-01-01 2021-03-06
1     20     80 2020-01-01 2021-03-06     30     40 2020-01-01 2020-12-31
2     30     90 2020-01-01 2021-06-04     50     60 2021-01-01 2021-06-04
2     30     90 2020-01-01 2021-06-04     50     60 2020-01-01 2020-12-31
3     40    100 2020-12-05 2021-07-04     70     80 2021-01-01 2021-07-04
3     40    100 2020-12-05 2021-07-04     70     80 2020-12-05 2020-12-31
4     89    300 2020-04-05 2022-06-08     40     10 2020-04-05 2022-06-08
4     89    300 2020-04-05 2022-06-08     40     10 2022-01-01 2020-12-31
4     89    300 2020-04-05 2022-06-08     40     10 2021-01-01 2021-12-31

Upvotes: 1

Related Questions