Reputation: 83
I have a Dataframe df1
that has a bunch of columns like so:
val_1 | val_2 | start | end | val_3 | val_4 | |
---|---|---|---|---|---|---|
0 | 10 | 70 | 1/1/2020 | 3/4/2020 | 10 | 20 |
1 | 20 | 80 | 1/1/2020 | 3/6/2021 | 30 | 40 |
2 | 30 | 90 | 1/1/2020 | 6/4/2021 | 50 | 60 |
3 | 40 | 100 | 12/5/2020 | 7/4/2021 | 70 | 80 |
4 | 89 | 300 | 4/5/2020 | 6/8/2022 | 40 | 10 |
I need to iterate over the rows, and split the cross-year periods into continuous same year ones. The remaining values in the row need to stay the same and maintain their data types like so:
val_1 | val_2 | start | end | val_3 | val_4 | |
---|---|---|---|---|---|---|
0 | 10 | 70 | 1/1/2020 | 3/4/2020 | 10 | 20 |
1 | 20 | 80 | 1/1/2020 | 12/31/2020 | 30 | 40 |
2 | 20 | 80 | 1/1/2021 | 3/6/2021 | 30 | 40 |
3 | 30 | 90 | 1/1/2020 | 12/31/2020 | 50 | 60 |
4 | 30 | 90 | 1/1/2021 | 6/4/2021 | 50 | 60 |
5 | 40 | 100 | 7/5/2021 | 11/17/2021 | 70 | 80 |
6 | 89 | 300 | 4/5/2020 | 12/31/2020 | 40 | 10 |
7 | 89 | 300 | 1/1/2021 | 12/31/2021 | 40 | 10 |
8 | 89 | 300 | 1/1/2021 | 6/8/2022 | 40 | 10 |
Is there a fast and efficient way to do this? I tried iterating over the rows and doing it but I'm having trouble with the indices and appending rows after an index. Also, people have said that's probably not the best idea to edit things that I'm iterating over so I was wondering if there is a better way to do it. Any suggestions will be appreciated. Thank you!
EDIT
If the row spans more than a year, that should break into 3 or more rows, accordingly. I've edited the tables to accurately reflect this. Thank you!
Upvotes: 1
Views: 188
Reputation: 8952
Here's a different approach. Note that I've already converted start
and end
to datetimes, and I didn't bother sorting the resultant DataFrame because I didn't want to assume a specific ordering for your use-case.
import pandas as pd
def jump_to_new_year(df: pd.DataFrame) -> pd.DataFrame:
df["start"] = df["start"].map(lambda t: pd.Timestamp(t.year + 1, 1, 1))
return df
def fill_to_year_end(df: pd.DataFrame) -> pd.DataFrame:
df["end"] = df["start"].map(lambda t: pd.Timestamp(t.year, 12, 31))
return df
def roll_over(df: pd.DataFrame) -> pd.DataFrame:
mask = df.start.dt.year != df.end.dt.year
if all(~mask):
return df
start_df = fill_to_year_end(df[mask].copy())
end_df = roll_over(jump_to_new_year(df[mask].copy()))
return pd.concat([df[~mask], start_df, end_df]).reset_index(drop=True)
This is a recursive function. It first checks if any start-end date pairs have mismatched years. If not, we simply return the DataFrame. If so, we fill to the end of the year in the start_df
DataFrame. Then we jump to the new year and fill that to the end
date in the end_df
DataFrame. Then we recurse on end_df
, which will be a smaller subset of the original input.
Warning: this solution assumes that all start dates occur on or before the end date's year. If you start in 2020 and end in 2019, you will recurse infinitely and blow the stack.
Demo:
>>> df
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2021-07-04 70 80
4 89 300 2020-04-05 2022-06-08 40 10
>>> roll_over(df)
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
2 30 90 2020-01-01 2020-12-31 50 60
3 40 100 2020-12-05 2020-12-31 70 80
4 89 300 2020-04-05 2020-12-31 40 10
5 20 80 2021-01-01 2021-03-06 30 40
6 30 90 2021-01-01 2021-06-04 50 60
7 40 100 2021-01-01 2021-07-04 70 80
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10
# An example of reordering the DataFrame
>>> roll_over(df).sort_values(by=["val_1", "start"])
val_1 val_2 start end val_3 val_4
0 10 70 2020-01-01 2020-03-04 10 20
1 20 80 2020-01-01 2020-12-31 30 40
5 20 80 2021-01-01 2021-03-06 30 40
2 30 90 2020-01-01 2020-12-31 50 60
6 30 90 2021-01-01 2021-06-04 50 60
3 40 100 2020-12-05 2020-12-31 70 80
7 40 100 2021-01-01 2021-07-04 70 80
4 89 300 2020-04-05 2020-12-31 40 10
8 89 300 2021-01-01 2021-12-31 40 10
9 89 300 2022-01-01 2022-06-08 40 10
Upvotes: 2
Reputation: 323226
Find the year end after date_range
, then explode
df['end'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].strftime('%m/%d/%y').tolist() for x , y in zip(df['start'],df['end'])]
df = df.explode('end')
df
Out[29]:
val_1 val_2 start end val_3 val_4
0 10 70 1/1/2020 3/4/2020 10 20
1 20 80 1/1/2020 3/6/2021 30 40
1 20 80 1/1/2020 12/31/20 30 40
2 30 90 1/1/2020 6/4/2021 50 60
2 30 90 1/1/2020 12/31/20 50 60
3 40 100 12/5/2020 7/4/2021 70 80
3 40 100 12/5/2020 12/31/20 70 80
Update
df.end=pd.to_datetime(df.end)
df.start=pd.to_datetime(df.start)
df['Newstart'] = [list(set([x]+pd.date_range(x,y)[pd.date_range(x,y).is_year_start].tolist()))
for x , y in zip(df['start'],df['end'])]
df['Newend'] = [[y]+pd.date_range(x,y)[pd.date_range(x,y).is_year_end].tolist()
for x , y in zip(df['start'],df['end'])]
out = df.explode(['Newend','Newstart'])
val_1 val_2 start end val_3 val_4 Newstart Newend
0 10 70 2020-01-01 2020-03-04 10 20 2020-01-01 2020-03-04
1 20 80 2020-01-01 2021-03-06 30 40 2021-01-01 2021-03-06
1 20 80 2020-01-01 2021-03-06 30 40 2020-01-01 2020-12-31
2 30 90 2020-01-01 2021-06-04 50 60 2021-01-01 2021-06-04
2 30 90 2020-01-01 2021-06-04 50 60 2020-01-01 2020-12-31
3 40 100 2020-12-05 2021-07-04 70 80 2021-01-01 2021-07-04
3 40 100 2020-12-05 2021-07-04 70 80 2020-12-05 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2020-04-05 2022-06-08
4 89 300 2020-04-05 2022-06-08 40 10 2022-01-01 2020-12-31
4 89 300 2020-04-05 2022-06-08 40 10 2021-01-01 2021-12-31
Upvotes: 1