Celius Stingher
Celius Stingher

Reputation: 18367

Fill index with daily values between an interval

I am currently working on some json files that contains certain a data key with values ranging from 2018-07-05 to 2019-02-05 and another key (balance) which I'm using as column. I'm trying to understand how to fill or replace these dates with daily values so that when the date exists, the balance column has its corresponding value, otherwise I get NaN (which then I will fill with bfill, but I know how to do that already).

This is my code to access the json file:

x = 'C:\\Users\\ivan.libedinsky\\Downloads\\example.json'
with open(x, encoding="Utf-8") as w:
    data = json.load(w)
    d = {'onDate':[],'balance':[]}
    for i in data['accountList'][0]['transactionList']:
        d['onDate'].append(i['onDate'])  
        d['balance'].append(i['balance'])
df = pd.DataFrame(d)
df1 = df.groupby('onDate').balance.first()

And the output I get is:

        onDate  balance
0   2018-07-05     3.73
1   2018-08-01   378.49
2   2018-08-03   328.49
3   2018-08-05   128.49
4   2018-08-06    78.49
..         ...      ...

What I'm trying to achieve but I'm not sure how is:

        onDate  balance
0   2018-07-05     3.73
1   2018-07-06      NaN
2   2018-07-07      NaN
3   2018-07-08      NaN
4   2018-07-09      NaN
..         ...      ...
27  2018-08-01    378.49 
..         ...      ...

So that when I do the backfill (which I have no trouble with) my end result is:

        onDate  balance
0   2018-07-05     3.73
1   2018-07-06     3.73
2   2018-07-07     3.73
3   2018-07-08     3.73
4   2018-07-09     3.73
..         ...      ...
27  2018-08-01    378.49 
..         ...      ...

My first thought is to create another list which then I will use it to do a left join with the dataframe.

from datetime import datetime, timedelta
date = min(ondate)
while max(ondate) > max(date):
    date.append(date + timedelta(days=1)

But I'm not sure if this is optimal and scalable. Thanks in advance.

Upvotes: 2

Views: 199

Answers (1)

root
root

Reputation: 33773

Use DataFrame.resample:

# ensure 'onDate' column has been converted to timestamps
df['onDate'] = pd.to_datetime(df['onDate'])

# resample, setting/resetting index as appropriate
df = df.set_index('onDate')
df = df.resample('D').ffill()
df = df.reset_index()

Or alternatively DataFrame.reindex may be a little more generic if you want to explicitly include start/end dates outside the existing range of dates:

# ensure 'onDate' is of the proper dtype
df['onDate'] = pd.to_datetime(df['onDate'])

# reindex, setting/resetting index as appropriate
full_idx = pd.date_range('2018-07-05', '2018-08-06', name='onDate')
df = df.set_index('onDate')
df = df.reindex(full_idx, method='ffill')
df = df.reset_index()

The resulting output for either method:

       onDate  balance
0  2018-07-05     3.73
1  2018-07-06     3.73
2  2018-07-07     3.73
3  2018-07-08     3.73
4  2018-07-09     3.73
5  2018-07-10     3.73
6  2018-07-11     3.73
7  2018-07-12     3.73
8  2018-07-13     3.73
9  2018-07-14     3.73
10 2018-07-15     3.73
11 2018-07-16     3.73
12 2018-07-17     3.73
13 2018-07-18     3.73
14 2018-07-19     3.73
15 2018-07-20     3.73
16 2018-07-21     3.73
17 2018-07-22     3.73
18 2018-07-23     3.73
19 2018-07-24     3.73
20 2018-07-25     3.73
21 2018-07-26     3.73
22 2018-07-27     3.73
23 2018-07-28     3.73
24 2018-07-29     3.73
25 2018-07-30     3.73
26 2018-07-31     3.73
27 2018-08-01   378.49
28 2018-08-02   378.49
29 2018-08-03   328.49
30 2018-08-04   328.49
31 2018-08-05   128.49
32 2018-08-06    78.49

Upvotes: 1

Related Questions