Filling Pandas dataframe based on date columns and date range

Question

I have a pandas dataframe that looks like this,

    id     start    end
0   1   2020-02-01  2020-04-01
1   2   2020-04-01  2020-04-28

I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.

I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",

    id  month   status
0   1   2020-01 -1
1   1   2020-02 1
2   1   2020-03 2
3   1   2020-04 2
4   1   2020-05 -1
5   1   2020-06 -1
6   2   2020-01 -1
7   2   2020-02 -1
8   2   2020-03 -1
9   2   2020-04 1
10  2   2020-05 -1
11  2   2020-06 -1

The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,

If the month column value is equal to month of start column then fill status as 1
If the month column value is greater than month of start column but less than or equal to month of end column fill it as 2.
If the month column value is less than month of start month then fill it as -1. Also if the month column value is greater than month of end fill status with -1.

I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.

Is there any pandas functions that can help me here?

Thanks @Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,

    id     start       end
0   1   2020-02-01  2020-02-20
1   1   2020-04-01  2020-05-10
2   2   2020-04-10  2020-04-28

One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.

Code Different · Accepted Answer

Make sure the start and end columns are of type Timestamp:

# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'

df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])

# Determine the status
df['status'] = -1

cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1

cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2

Filling Pandas dataframe based on date columns and date range

Answers (1)

Related Questions