Reputation: 11907
I have a pandas dataframe that looks like this,
id start end
0 1 2020-02-01 2020-04-01
1 2 2020-04-01 2020-04-28
I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.
I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",
id month status
0 1 2020-01 -1
1 1 2020-02 1
2 1 2020-03 2
3 1 2020-04 2
4 1 2020-05 -1
5 1 2020-06 -1
6 2 2020-01 -1
7 2 2020-02 -1
8 2 2020-03 -1
9 2 2020-04 1
10 2 2020-05 -1
11 2 2020-06 -1
The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,
I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.
Is there any pandas functions that can help me here?
Thanks @Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,
id start end
0 1 2020-02-01 2020-02-20
1 1 2020-04-01 2020-05-10
2 2 2020-04-10 2020-04-28
One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.
Upvotes: 1
Views: 1497
Reputation: 93141
Make sure the start
and end
columns are of type Timestamp
:
# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'
df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])
# Determine the status
df['status'] = -1
cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1
cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2
Upvotes: 1