Reputation: 5117
Let's suppose that I have a dataset which consists of the following columns:
Stock_id
: the id of a stockDate
: a date of 2018 e.g. 25/03/2018Stock_value
: the value of the stock at this specific dateI have some dates, different for each stock, which are entirely missing from the dataset and I would like to fill them in.
By missing dates, I mean that there is not even a row for each of these dates; not that these exist on the dataset and simply that the Stock_value
at the rows is NA etc.
A limitation is that some stocks were introduced to the stock market in some time in 2018 so apparently I do not want to fill in dates for these stocks while these stocks were not existent.
By this I mean that if a stock was introduced to the stock market at the 21/05/2018 then apparently I want to fill in any missing dates for this stock from 21/05/2018 to 31/12/2018 but not dates before the 21/05/2018.
What is the most efficient way to do this?
I have seen some posts on StackOverflow (post_1, post_2 etc) but I think that my case is a more special one so I would like to see an efficient way to do this.
Let me provide an example. Let's limit this only to two stocks and only to the week from 01/01/2018 to the 07/01/2018 otherwise it won't fit in here.
Let's suppose that I initially have the following:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 06/01/2018 150
2 07/01/2018 147
Thus for Stock_id
= 1 the date 04/01/2018 is missing.
For Stock_id
= 2 the date 05/01/2018 is missing and since the dates for this stock are starting at 03/01/2018 then the dates before this date should not be filled in (because the stock was introduced at the stock market at the 03/01/2018).
Hence, I would like to have the following as output:
Stock_id Date Stock_value
1 01/01/2018 124
1 02/01/2018 130
1 03/01/2018 136
1 04/01/2018 NA
1 05/01/2018 129
1 06/01/2018 131
1 07/01/2018 133
2 03/01/2018 144
2 04/01/2018 148
2 05/01/2018 NA
2 06/01/2018 150
2 07/01/2018 147
Upvotes: 0
Views: 125
Reputation: 862641
Use asfreq
per groups, but if large data performance should be problematic:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.asfreq('D'))
.reset_index()
)
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 124.0
1 1 2018-01-02 130.0
2 1 2018-01-03 136.0
3 1 2018-01-04 NaN
4 1 2018-01-05 129.0
5 1 2018-01-06 131.0
6 1 2018-01-07 133.0
7 2 2018-01-03 144.0
8 2 2018-01-04 148.0
9 2 2018-01-05 NaN
10 2 2018-01-06 150.0
11 2 2018-01-07 147.0
EDIT:
If want change values by minimal datetime per group with some scalar for maximum datetime
, use reindex
with date_range
:
df = (df.set_index( 'Date')
.groupby('Stock_id')['Stock_value']
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), '2019-02-20')))
.reset_index()
)
Upvotes: 1
Reputation: 1272
df.set_index(['Date', 'Stock_id']).unstack().fillna(method='ffill').stack().reset_index()
Upvotes: 0