Reputation: 788
my pandas dataframe looks like the below
country date gd
US 01-01-2014 2
US 01-01-2015 3
US 01-01-2013 0.4
UK 01-01-2000 0.7
UK 02-01-2001 0.5
UK 01-01-2016 1
what I want to do is :
1) Fill all dates (daily) starting from each countries minimum date so say for US it is 01-01-2013 upto today and for UK it is 01-01-2000 daily upto today.
2) Fill gd column with previous available data
many thanks for your help
Upvotes: 1
Views: 723
Reputation: 879103
You could make date
the index and then use reindex
to expand the dates and ffill
to forward-fill the NaNs:
def expand_dates(grp):
start = grp.index.min()
end = today
index = pd.date_range(start, end, freq='D')
return grp.reindex(index).ffill()
Use groupby/apply
to call expand_dates
once for each group and concatenate the results:
df = df.groupby('country')['gd'].apply(expand_dates)
Correction: My first answer forward-filled the entire DataFrame as the last step: df = df.ffill()
. That is correct only if each country's first gd
value is not NaN
. If the starting row(s) for a certain country have NaN
gd
value(s), then forward-filling may contaminate those gd
values with values from another country. Yikes. The more robust and correct method would be to forward-fill once for each group as shown by piRSquared. Any performance gain achieved by forward-filling once instead of many times on smaller DataFrames would be minor since the number of ffill calls is limited by the number of countries (a pretty low number) and safe-guarding against a potential bug is far more important than the limited performance gain that is possible.
import numpy as np
import pandas as pd
df = pd.DataFrame({'country': ['US', 'US', 'US', 'UK', 'UK', 'UK'], 'date': ['01-01-2014', '01-01-2015', '01-01-2013', '01-01-2000', '02-01-2001', '01-01-2016'], 'gd': [2.0, 3.0, 0.4, 0.7, 0.5, 1.0]})
df['date'] = pd.to_datetime(df['date'])
today = pd.Timestamp('today')
def expand_dates(grp):
start = grp.index.min()
end = today
index = pd.date_range(start, end, freq='D')
return grp.reindex(index).ffill()
df = df.set_index('date')
df = df.groupby('country')['gd'].apply(expand_dates)
print(pd.concat([df.head(), df.tail()]))
yields
country
UK 2000-01-01 0.7
2000-01-02 0.7
2000-01-03 0.7
2000-01-04 0.7
2000-01-05 0.7
US 2017-07-14 3.0
2017-07-15 3.0
2017-07-16 3.0
2017-07-17 3.0
2017-07-18 3.0
Name: gd, dtype: float64
Upvotes: 3
Reputation: 294218
s = df.set_index(['country', 'date']).gd
today = pd.datetime.today()
def then2now(x):
x = x.xs(x.name)
mn = x.index.min()
return x.reindex(pd.date_range(mn, today, name='date')).ffill()
s.groupby(level='country').apply(then2now).reset_index()
country date gd
0 UK 2000-01-01 0.7
400 UK 2001-02-04 0.5
800 UK 2002-03-11 0.5
1200 UK 2003-04-15 0.5
1600 UK 2004-05-19 0.5
2000 UK 2005-06-23 0.5
2400 UK 2006-07-28 0.5
2800 UK 2007-09-01 0.5
3200 UK 2008-10-05 0.5
3600 UK 2009-11-09 0.5
4000 UK 2010-12-14 0.5
4400 UK 2012-01-18 0.5
4800 UK 2013-02-21 0.5
5200 UK 2014-03-28 0.5
5600 UK 2015-05-02 0.5
6000 UK 2016-06-05 1.0
6400 UK 2017-07-10 1.0
6800 US 2014-01-27 2.0
7200 US 2015-03-03 3.0
7600 US 2016-04-06 3.0
8000 US 2017-05-11 3.0
Upvotes: 3
Reputation: 210822
In [67]: today = pd.to_datetime(pd.datetime.now()).normalize()
In [68]: l = df.country.nunique()
In [72]: df.append(pd.DataFrame({'country':df.country.unique(), 'date':[today]*l, 'gd':[np.nan]*l})) \
...: .sort_values('date') \
...: .groupby('country') \
...: .resample('1D', on='date') \
...: .mean() \
...: .reset_index() \
...: .ffill()
...:
Out[72]:
country date gd
0 UK 2000-01-01 0.7
1 UK 2000-01-02 0.7
2 UK 2000-01-03 0.7
3 UK 2000-01-04 0.7
4 UK 2000-01-05 0.7
5 UK 2000-01-06 0.7
6 UK 2000-01-07 0.7
7 UK 2000-01-08 0.7
8 UK 2000-01-09 0.7
9 UK 2000-01-10 0.7
... ... ... ...
8059 US 2017-07-09 3.0
8060 US 2017-07-10 3.0
8061 US 2017-07-11 3.0
8062 US 2017-07-12 3.0
8063 US 2017-07-13 3.0
8064 US 2017-07-14 3.0
8065 US 2017-07-15 3.0
8066 US 2017-07-16 3.0
8067 US 2017-07-17 3.0
8068 US 2017-07-18 3.0
[8069 rows x 3 columns]
Upvotes: 3