Sreeram TP
Sreeram TP

Reputation: 11907

Filling Pandas dataframe based on date columns and date range

I have a pandas dataframe that looks like this,

    id     start    end
0   1   2020-02-01  2020-04-01
1   2   2020-04-01  2020-04-28

I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.

I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",

    id  month   status
0   1   2020-01 -1
1   1   2020-02 1
2   1   2020-03 2
3   1   2020-04 2
4   1   2020-05 -1
5   1   2020-06 -1
6   2   2020-01 -1
7   2   2020-02 -1
8   2   2020-03 -1
9   2   2020-04 1
10  2   2020-05 -1
11  2   2020-06 -1

The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,

I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.

Is there any pandas functions that can help me here?

Thanks @Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,

    id     start       end
0   1   2020-02-01  2020-02-20
1   1   2020-04-01  2020-05-10
2   2   2020-04-10  2020-04-28

One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.

Upvotes: 1

Views: 1497

Answers (1)

Code Different
Code Different

Reputation: 93141

Make sure the start and end columns are of type Timestamp:

# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'

df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])

# Determine the status
df['status'] = -1

cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1

cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2

Upvotes: 1

Related Questions