Reputation: 571
I have a data frame that looks like:
Country Year Value
USA 1991 22
USA 1992 3
USA 1993 10
China 1991 1
China 1993 15
Argentina 1991 6
Argentina 1992 4
I need a function that could find the missing year for each country and add a row wit a NaN value to the data frame.
Country Year Value
USA 1991 22
USA 1992 3
USA 1993 10
China 1991 1
China 1992 NaN
China 1993 15
Argentina 1991 6
Argentina 1992 4
Argentina 1993 NaN
I need also to create a dataframe with values based only on the years when I have the values for all countries.
Country Year Value
USA 1991 22
China 1991 1
Argentina 1991 6
Upvotes: 1
Views: 60
Reputation: 862611
Use DataFrame.set_index
with MultiIndex.from_product
for DataFrame.reindex
:
df = df.set_index(['Country','Year'])
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
Country Year Value
0 Argentina 1991 6.0
1 Argentina 1992 4.0
2 Argentina 1993 NaN
3 China 1991 1.0
4 China 1992 NaN
5 China 1993 15.0
6 USA 1991 22.0
7 USA 1992 3.0
8 USA 1993 10.0
And for groups with no missing values:
vals = df1.loc[df1['Value'].isna(), 'Country'].unique()
df2 = df1[~df1['Country'].isin(vals)]
print (df2)
Country Year Value
6 USA 1991 22.0
7 USA 1992 3.0
8 USA 1993 10.0
Alternative is use DataFrame.unstack
with DataFrame.stack
:
s = df.set_index(['Country','Year']).unstack()
df1 = s.stack(dropna=False).reset_index()
print (df1)
Country Year Value
0 Argentina 1991 6.0
1 Argentina 1992 4.0
2 Argentina 1993 NaN
3 China 1991 1.0
4 China 1992 NaN
5 China 1993 15.0
6 USA 1991 22.0
7 USA 1992 3.0
8 USA 1993 10.0
For all values per columns use DataFrame.dropna
:
df2 = s.dropna(axis=1).stack().reset_index()
print (df2)
Country Year Value
0 Argentina 1991 6.0
1 China 1991 1.0
2 USA 1991 22.0
EDIT:
If get:
ValueError: cannot handle a non-unique multi-index!
it means there are not unique combinations of Country
and Year
column:
print (df)
Country Year Value
0 USA 1991 22 <-duplicate USA, 1991
1 USA 1991 3 <-duplicate USA, 1991
2 USA 1993 10
3 China 1991 1
4 China 1993 15
5 Argentina 1991 6
6 Argentina 1992 4
solution is change set_index
by groupby
with some aggregate function like mean
, sum
for unique combinations:
df = df.groupby(['Country','Year']).mean()
mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(mux).reset_index()
print (df)
Country Year Value
0 Argentina 1991 6.0
1 Argentina 1992 4.0
2 Argentina 1993 NaN
3 China 1991 1.0
4 China 1992 NaN
5 China 1993 15.0
6 USA 1991 12.5
7 USA 1992 NaN
8 USA 1993 10.0
Upvotes: 2