Reputation: 18367
I am currently working on some json files that contains certain a data key with values ranging from 2018-07-05 to 2019-02-05 and another key (balance) which I'm using as column. I'm trying to understand how to fill or replace these dates with daily values so that when the date exists, the balance column has its corresponding value, otherwise I get NaN (which then I will fill with bfill, but I know how to do that already).
This is my code to access the json file:
x = 'C:\\Users\\ivan.libedinsky\\Downloads\\example.json'
with open(x, encoding="Utf-8") as w:
data = json.load(w)
d = {'onDate':[],'balance':[]}
for i in data['accountList'][0]['transactionList']:
d['onDate'].append(i['onDate'])
d['balance'].append(i['balance'])
df = pd.DataFrame(d)
df1 = df.groupby('onDate').balance.first()
And the output I get is:
onDate balance
0 2018-07-05 3.73
1 2018-08-01 378.49
2 2018-08-03 328.49
3 2018-08-05 128.49
4 2018-08-06 78.49
.. ... ...
What I'm trying to achieve but I'm not sure how is:
onDate balance
0 2018-07-05 3.73
1 2018-07-06 NaN
2 2018-07-07 NaN
3 2018-07-08 NaN
4 2018-07-09 NaN
.. ... ...
27 2018-08-01 378.49
.. ... ...
So that when I do the backfill (which I have no trouble with) my end result is:
onDate balance
0 2018-07-05 3.73
1 2018-07-06 3.73
2 2018-07-07 3.73
3 2018-07-08 3.73
4 2018-07-09 3.73
.. ... ...
27 2018-08-01 378.49
.. ... ...
My first thought is to create another list which then I will use it to do a left join with the dataframe.
from datetime import datetime, timedelta
date = min(ondate)
while max(ondate) > max(date):
date.append(date + timedelta(days=1)
But I'm not sure if this is optimal and scalable. Thanks in advance.
Upvotes: 2
Views: 199
Reputation: 33773
Use DataFrame.resample
:
# ensure 'onDate' column has been converted to timestamps
df['onDate'] = pd.to_datetime(df['onDate'])
# resample, setting/resetting index as appropriate
df = df.set_index('onDate')
df = df.resample('D').ffill()
df = df.reset_index()
Or alternatively DataFrame.reindex
may be a little more generic if you want to explicitly include start/end dates outside the existing range of dates:
# ensure 'onDate' is of the proper dtype
df['onDate'] = pd.to_datetime(df['onDate'])
# reindex, setting/resetting index as appropriate
full_idx = pd.date_range('2018-07-05', '2018-08-06', name='onDate')
df = df.set_index('onDate')
df = df.reindex(full_idx, method='ffill')
df = df.reset_index()
The resulting output for either method:
onDate balance
0 2018-07-05 3.73
1 2018-07-06 3.73
2 2018-07-07 3.73
3 2018-07-08 3.73
4 2018-07-09 3.73
5 2018-07-10 3.73
6 2018-07-11 3.73
7 2018-07-12 3.73
8 2018-07-13 3.73
9 2018-07-14 3.73
10 2018-07-15 3.73
11 2018-07-16 3.73
12 2018-07-17 3.73
13 2018-07-18 3.73
14 2018-07-19 3.73
15 2018-07-20 3.73
16 2018-07-21 3.73
17 2018-07-22 3.73
18 2018-07-23 3.73
19 2018-07-24 3.73
20 2018-07-25 3.73
21 2018-07-26 3.73
22 2018-07-27 3.73
23 2018-07-28 3.73
24 2018-07-29 3.73
25 2018-07-30 3.73
26 2018-07-31 3.73
27 2018-08-01 378.49
28 2018-08-02 378.49
29 2018-08-03 328.49
30 2018-08-04 328.49
31 2018-08-05 128.49
32 2018-08-06 78.49
Upvotes: 1