Reputation: 355
I'm trying to generate missing lines in my dataframe based on years (integer). I tried several solutions that I found on stackoverflow but did not work. it returns me the error: ValueError: cannot reindex from a duplicate axis
.
Add missing dates to pandas dataframe
Missing data, insert rows in Pandas and fill with NAN
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
df
What I m looking for
Upvotes: 3
Views: 338
Reputation: 150735
IIUC you want to interpolate the data to the maximal year range by ID:
# get the min and max year range
# change max_year if needed
min_year, max_year = df.year.min(), df.year.max()
# for merge
new_df = pd.DataFrame({'year':np.arange(min_year,max_year+1)})
# groupby:
new_df = df.groupby('id').apply(lambda x: new_df.merge(x, how='outer')
.drop('id',axis=1)
).reset_index()
new_df.value.fillna(0, inplace=True)
# this will fill other columns with copies of others
new_df.groupby('id').ffill()
Output:
id level_1 year value
0 100 0 2010 3000.0
1 100 1 2011 0.0
2 100 2 2012 0.0
3 100 3 2013 1000.0
4 100 4 2014 2000.0
5 100 5 2015 1200.0
6 100 6 2016 1300.0
7 200 0 2010 2000.0
8 200 1 2011 0.0
9 200 2 2012 1500.0
10 200 3 2013 0.0
11 200 4 2014 0.0
12 200 5 2015 0.0
13 200 6 2016 0.0
Upvotes: 3
Reputation: 862591
Idea is create MultiIndex
by DataFrame.set_index
and DataFrame.reindex
by MultiIndex
created by MultiIndex.from_product
:
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
#you can specify minimal and maximal year by scalar
mux = pd.MultiIndex.from_product([df['id'].unique(),
np.arange(df['year'].min(), 2019)],
names=['id','year'])
df1 = df.set_index(['id','year']).reindex(mux, fill_value=0).reset_index()
print (df1)
id year value
0 100 2010 3000
1 100 2011 0
2 100 2012 0
3 100 2013 1000
4 100 2014 2000
5 100 2015 1200
6 100 2016 1300
7 100 2017 0
8 100 2018 0
9 200 2010 2000
10 200 2011 0
11 200 2012 1500
12 200 2013 0
13 200 2014 0
14 200 2015 0
15 200 2016 0
16 200 2017 0
17 200 2018 0
#you can specify minimal and maximal year by min and max functions
mux = pd.MultiIndex.from_product([df['id'].unique(),
np.arange(df['year'].min(), df['year'].max() + 1)],
names=['id','year'])
df2 = df.set_index(['id','year']).reindex(mux, fill_value=0).reset_index()
print (df2)
id year value
0 100 2010 3000
1 100 2011 0
2 100 2012 0
3 100 2013 1000
4 100 2014 2000
5 100 2015 1200
6 100 2016 1300
7 200 2010 2000
8 200 2011 0
9 200 2012 1500
10 200 2013 0
11 200 2014 0
12 200 2015 0
13 200 2016 0
Upvotes: 4
Reputation: 1049
Use pd.merge_ordered method:
df=pd.DataFrame({'id':(100,100,100,100,100,200,200),'year':(2010,2013,2014,2015,2016,2010,2012),\
'value':[3000,1000,2000,1200,1300,2000,1500]})
df_year=pd.DataFrame({'year':(2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020)})
df_merged=pd.merge_ordered(df,df_year,left_by='id').fillna(0)
Upvotes: 0
Reputation: 75080
Another method using groupby
and reindex()
l=[g.set_index('year').reindex(range(2010,2019)) for i,g in df.groupby('id')]
final=pd.concat(l).reset_index()
final=final.assign(**{'id':final.id.ffill(),'value':final.value.fillna(0)})
print(final)
year id value
0 2010 100.0 3000.0
1 2011 100.0 0.0
2 2012 100.0 0.0
3 2013 100.0 1000.0
4 2014 100.0 2000.0
5 2015 100.0 1200.0
6 2016 100.0 1300.0
7 2017 100.0 0.0
8 2018 100.0 0.0
9 2010 200.0 2000.0
10 2011 200.0 0.0
11 2012 200.0 1500.0
12 2013 200.0 0.0
13 2014 200.0 0.0
14 2015 200.0 0.0
15 2016 200.0 0.0
16 2017 200.0 0.0
17 2018 200.0 0.0
Upvotes: 3
Reputation: 4482
Here is an approah which generates rows for "missing" years :
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
#Pick the existing year extremes
min_year = df.year.min()
max_year = df.year.max()
#Look for missing years
missing_years = [x for x in range(min_year, max_year+1) if x not in df.year.values]
#Generate a dataframe
df_missing = pd.DataFrame([[None, x, None] for x in missing_years], columns = df.columns)
#Append it to original
df.append(df_missing)
Output
+----+-------+-------+-------+
| | id | year | value |
+----+-------+-------+-------+
| 0 | 100 | 2010 | 3000 |
| 1 | 100 | 2013 | 1000 |
| 2 | 100 | 2014 | 2000 |
| 3 | 100 | 2015 | 1200 |
| 4 | 100 | 2016 | 1300 |
| 5 | 200 | 2010 | 2000 |
| 6 | 200 | 2012 | 1500 |
| 0 | None | 2011 | None |
+----+-------+-------+-------+
Upvotes: 2
Reputation: 4744
import datetime
import pandas as pd
year = datetime.datetime.today().year #Today date(year)
data = {'id': [100, 100, 100, 100, 100, 200, 200],
'year': [2010, 2013, 2014, 2015, 2016, 2010, 2012],
'value': [3000, 1000, 2000, 1200, 1300, 2000, 1500]}
df = pd.DataFrame(data)
min = df['year'].min() #Find min year in data's
difference = year-min
years = range(year, year -difference , -1)
data['year'] = years
df = pd.DataFrame.from_dict(data, orient='index') #When we add years,it gives an error 'arrays must all be same length' so avoid it with adding 'Nan
df.transpose()
print df
Upvotes: 2