adil blanco
adil blanco

Reputation: 355

How to generate missing lines in a dataframe

I'm trying to generate missing lines in my dataframe based on years (integer). I tried several solutions that I found on stackoverflow but did not work. it returns me the error: ValueError: cannot reindex from a duplicate axis.

Add missing dates to pandas dataframe

Missing data, insert rows in Pandas and fill with NAN

data = {'id': [100, 100, 100, 100, 100, 200, 200], 
    'year':  [2010, 2013, 2014, 2015, 2016, 2010, 2012],
    'value':  [3000, 1000, 2000, 1200, 1300, 2000, 1500]}

df = pd.DataFrame(data) 
df

enter image description here

What I m looking for

enter image description here

Upvotes: 3

Views: 338

Answers (6)

Quang Hoang
Quang Hoang

Reputation: 150735

IIUC you want to interpolate the data to the maximal year range by ID:

# get the min and max year range
# change max_year if needed
min_year, max_year = df.year.min(), df.year.max()

# for merge
new_df = pd.DataFrame({'year':np.arange(min_year,max_year+1)})

# groupby:
new_df = df.groupby('id').apply(lambda x: new_df.merge(x, how='outer')
                                       .drop('id',axis=1)
                               ).reset_index()

new_df.value.fillna(0, inplace=True)

# this will fill other columns with copies of others
new_df.groupby('id').ffill()

Output:

     id  level_1  year   value
0   100        0  2010  3000.0
1   100        1  2011     0.0
2   100        2  2012     0.0
3   100        3  2013  1000.0
4   100        4  2014  2000.0
5   100        5  2015  1200.0
6   100        6  2016  1300.0
7   200        0  2010  2000.0
8   200        1  2011     0.0
9   200        2  2012  1500.0
10  200        3  2013     0.0
11  200        4  2014     0.0
12  200        5  2015     0.0
13  200        6  2016     0.0

Upvotes: 3

jezrael
jezrael

Reputation: 862591

Idea is create MultiIndex by DataFrame.set_index and DataFrame.reindex by MultiIndex created by MultiIndex.from_product:

data = {'id': [100, 100, 100, 100, 100, 200, 200], 
    'year':  [2010, 2013, 2014, 2015, 2016, 2010, 2012],
    'value':  [3000, 1000, 2000, 1200, 1300, 2000, 1500]}

df = pd.DataFrame(data) 

#you can specify minimal and maximal year by scalar
mux = pd.MultiIndex.from_product([df['id'].unique(),
                                  np.arange(df['year'].min(), 2019)],
                                  names=['id','year'])

df1 = df.set_index(['id','year']).reindex(mux, fill_value=0).reset_index()
print (df1)
     id  year  value
0   100  2010   3000
1   100  2011      0
2   100  2012      0
3   100  2013   1000
4   100  2014   2000
5   100  2015   1200
6   100  2016   1300
7   100  2017      0
8   100  2018      0
9   200  2010   2000
10  200  2011      0
11  200  2012   1500
12  200  2013      0
13  200  2014      0
14  200  2015      0
15  200  2016      0
16  200  2017      0
17  200  2018      0

#you can specify minimal and maximal year by min and max functions
mux = pd.MultiIndex.from_product([df['id'].unique(),
                                  np.arange(df['year'].min(), df['year'].max() + 1)], 
                                  names=['id','year'])

df2 = df.set_index(['id','year']).reindex(mux, fill_value=0).reset_index()
print (df2)
     id  year  value
0   100  2010   3000
1   100  2011      0
2   100  2012      0
3   100  2013   1000
4   100  2014   2000
5   100  2015   1200
6   100  2016   1300
7   200  2010   2000
8   200  2011      0
9   200  2012   1500
10  200  2013      0
11  200  2014      0
12  200  2015      0
13  200  2016      0

Upvotes: 4

Nev1111
Nev1111

Reputation: 1049

Use pd.merge_ordered method:

df=pd.DataFrame({'id':(100,100,100,100,100,200,200),'year':(2010,2013,2014,2015,2016,2010,2012),\
                                                        'value':[3000,1000,2000,1200,1300,2000,1500]})

df_year=pd.DataFrame({'year':(2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020)})


df_merged=pd.merge_ordered(df,df_year,left_by='id').fillna(0)

Upvotes: 0

anky
anky

Reputation: 75080

Another method using groupby and reindex()

l=[g.set_index('year').reindex(range(2010,2019)) for i,g in df.groupby('id')]
final=pd.concat(l).reset_index()
final=final.assign(**{'id':final.id.ffill(),'value':final.value.fillna(0)})
print(final)

    year     id   value
0   2010  100.0  3000.0
1   2011  100.0     0.0
2   2012  100.0     0.0
3   2013  100.0  1000.0
4   2014  100.0  2000.0
5   2015  100.0  1200.0
6   2016  100.0  1300.0
7   2017  100.0     0.0
8   2018  100.0     0.0
9   2010  200.0  2000.0
10  2011  200.0     0.0
11  2012  200.0  1500.0
12  2013  200.0     0.0
13  2014  200.0     0.0
14  2015  200.0     0.0
15  2016  200.0     0.0
16  2017  200.0     0.0
17  2018  200.0     0.0

Upvotes: 3

Sebastien D
Sebastien D

Reputation: 4482

Here is an approah which generates rows for "missing" years :

data = {'id': [100, 100, 100, 100, 100, 200, 200], 
    'year':  [2010, 2013, 2014, 2015, 2016, 2010, 2012],
    'value':  [3000, 1000, 2000, 1200, 1300, 2000, 1500]}

df = pd.DataFrame(data) 
#Pick the existing year extremes
min_year = df.year.min()
max_year = df.year.max()

#Look for missing years
missing_years = [x for x in range(min_year, max_year+1) if x not in df.year.values]

#Generate a dataframe
df_missing =  pd.DataFrame([[None, x, None] for x in missing_years], columns = df.columns)

#Append it to original 
df.append(df_missing)

Output

+----+-------+-------+-------+
|    |  id   | year  | value |
+----+-------+-------+-------+
| 0  | 100   | 2010  | 3000  |
| 1  | 100   | 2013  | 1000  |
| 2  | 100   | 2014  | 2000  |
| 3  | 100   | 2015  | 1200  |
| 4  | 100   | 2016  | 1300  |
| 5  | 200   | 2010  | 2000  |
| 6  | 200   | 2012  | 1500  |
| 0  | None  | 2011  | None  |
+----+-------+-------+-------+

Upvotes: 2

Omer Tekbiyik
Omer Tekbiyik

Reputation: 4744

import datetime
import pandas as pd
year = datetime.datetime.today().year #Today date(year)

data = {'id': [100, 100, 100, 100, 100, 200, 200], 
    'year':  [2010, 2013, 2014, 2015, 2016, 2010, 2012],
    'value':  [3000, 1000, 2000, 1200, 1300, 2000, 1500]}

df = pd.DataFrame(data)
min = df['year'].min() #Find min year in data's

difference = year-min
years = range(year, year -difference , -1)

data['year'] = years

df = pd.DataFrame.from_dict(data, orient='index') #When we add years,it gives an error 'arrays must all be same length' so avoid it with adding 'Nan
df.transpose()
print df

Upvotes: 2

Related Questions