Outcast
Outcast

Reputation: 5117

Add different missing dates for groups of rows

Let's suppose that I have a dataset which consists of the following columns:

I have some dates, different for each stock, which are entirely missing from the dataset and I would like to fill them in.

By missing dates, I mean that there is not even a row for each of these dates; not that these exist on the dataset and simply that the Stock_value at the rows is NA etc.

A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent.

By this I mean that if a stock was introduced to the stock market at the 21/05/2018 then apparently I want to fill in any missing dates for this stock from 21/05/2018 to 31/12/2018 but not dates before the 21/05/2018.

What is the most efficient way to do this?

I have seen some posts on StackOverflow (post_1, post_2 etc) but I think that my case is a more special one so I would like to see an efficient way to do this.

Let me provide an example. Let's limit this only to two stocks and only to the week from 01/01/2018 to the 07/01/2018 otherwise it won't fit in here.

Let's suppose that I initially have the following:

Stock_id    Date    Stock_value
1   01/01/2018  124
1   02/01/2018  130
1   03/01/2018  136
1   05/01/2018  129
1   06/01/2018  131
1   07/01/2018  133
2   03/01/2018  144
2   04/01/2018  148
2   06/01/2018  150
2   07/01/2018  147

Thus for Stock_id = 1 the date 04/01/2018 is missing.

For Stock_id = 2 the date 05/01/2018 is missing and since the dates for this stock are starting at 03/01/2018 then the dates before this date should not be filled in (because the stock was introduced at the stock market at the 03/01/2018).

Hence, I would like to have the following as output:

Stock_id    Date    Stock_value
1   01/01/2018  124
1   02/01/2018  130
1   03/01/2018  136
1   04/01/2018  NA
1   05/01/2018  129
1   06/01/2018  131
1   07/01/2018  133
2   03/01/2018  144
2   04/01/2018  148
2   05/01/2018  NA
2   06/01/2018  150
2   07/01/2018  147

Upvotes: 0

Views: 125

Answers (2)

jezrael
jezrael

Reputation: 862641

Use asfreq per groups, but if large data performance should be problematic:

df = (df.set_index( 'Date')
        .groupby('Stock_id')['Stock_value']
        .apply(lambda x: x.asfreq('D'))
        .reset_index()
        )
print (df)
    Stock_id       Date  Stock_value
0          1 2018-01-01        124.0
1          1 2018-01-02        130.0
2          1 2018-01-03        136.0
3          1 2018-01-04          NaN
4          1 2018-01-05        129.0
5          1 2018-01-06        131.0
6          1 2018-01-07        133.0
7          2 2018-01-03        144.0
8          2 2018-01-04        148.0
9          2 2018-01-05          NaN
10         2 2018-01-06        150.0
11         2 2018-01-07        147.0

EDIT:

If want change values by minimal datetime per group with some scalar for maximum datetime, use reindex with date_range:

df = (df.set_index( 'Date')
        .groupby('Stock_id')['Stock_value']
        .apply(lambda x: x.reindex(pd.date_range(x.index.min(), '2019-02-20')))
        .reset_index()
        )

Upvotes: 1

heyu91
heyu91

Reputation: 1272

df.set_index(['Date', 'Stock_id']).unstack().fillna(method='ffill').stack().reset_index()

Upvotes: 0

Related Questions