therion
therion

Reputation: 445

How to impute default values to Pandas dataframe columns?

Following is the df.head() of a DataSet with Date set as index.

               Article_ID Country_Code Sold_Units
     Date           
2017-01-01        3576        AT         1
2017-01-02        1579        FR         1
2017-01-02         332        FI         2
2017-01-03        3576        AT         1
2017-01-03         332        SE         1

The Country Code has 4 values 'AT, FR, FI, SE'. I want to append these 4 country_codes to every single date if they are not present on those dates and impute their values to 0 in Article_ID and Sold_Units.

The Example output should basically look like this:

               Article_ID Country_Code Sold_Units
     Date           
2017-01-01        3576        AT         1
2017-01-01          0         FR         0    # FR FI SE added with 0s.
2017-01-01          0         FI         0
2017-01-01          0         SE         0
2017-01-02          0         AT         0    # AT, SE added
2017-01-02        1579        FR         1
2017-01-02         332        FI         2
2017-01-02          0         SE         0
2017-01-03        3576        AT         1
2017-01-03          0         FR         0    # FR, FI added
2017-01-03          0         FI         0
2017-01-03         332        SE         1

How can I add such default values for every country code?

Upvotes: 1

Views: 117

Answers (2)

Henry Yik
Henry Yik

Reputation: 22493

Another solution using pivot_table:

import pandas as pd

df = pd.DataFrame({'Date': {0: '2017-01-01', 1: '2017-01-02', 2: '2017-01-02', 3: '2017-01-03', 4: '2017-01-03'},
                   'Article_ID': {0: 3576, 1: 1579, 2: 332, 3: 3576, 4: 332},
                   'Country_Code': {0: 'AT', 1: 'FR', 2: 'FI', 3: 'AT', 4: 'SE'},
                   'Sold_Units': {0: 1, 1: 1, 2: 2, 3: 1, 4: 1}})

df.set_index("Date",inplace=True)

result = (df.pivot_table(index="Date", columns="Country_Code", values=["Article_ID","Sold_Units"],aggfunc="mean")
         .fillna("0").stack().reset_index(level=1))

print (result)

#
           Country_Code Article_ID Sold_Units
Date                                         
2017-01-01           AT       3576          1
2017-01-01           FI          0          0
2017-01-01           FR          0          0
2017-01-01           SE          0          0
2017-01-02           AT          0          0
2017-01-02           FI        332          2
2017-01-02           FR       1579          1
2017-01-02           SE          0          0
2017-01-03           AT       3576          1
2017-01-03           FI          0          0
2017-01-03           FR          0          0
2017-01-03           SE        332          1

Upvotes: 0

jezrael
jezrael

Reputation: 862521

Solutions for unique DatetimeIndex with Country_Code combinations:

Create MultiIndex by added Country_Code to DatetimeIndex with all combinations of unique values of datetimes with codes with DataFrame.reindex:

df = (df.set_index('Country_Code', append=True)
        .reindex(pd.MultiIndex.from_product([df.index.unique(), df['Country_Code'].unique()], 
                                            names=['Date','Country_Code']), fill_value=0)
        .reset_index(level=1))
print (df)
           Country_Code  Article_ID  Sold_Units
Date                                           
2017-01-01           AT        3576           1
2017-01-01           FR           0           0
2017-01-01           FI           0           0
2017-01-01           SE           0           0
2017-01-02           AT           0           0
2017-01-02           FR        1579           1
2017-01-02           FI         332           2
2017-01-02           SE           0           0
2017-01-03           AT        3576           1
2017-01-03           FR           0           0
2017-01-03           FI           0           0
2017-01-03           SE         332           1

Or use DataFrame.unstack with DataFrame.stack:

df = (df.set_index('Country_Code', append=True)
        .unstack(fill_value=0)
        .stack()
        .reset_index(level=1))

print (df)
           Country_Code  Article_ID  Sold_Units
Date                                           
2017-01-01           AT        3576           1
2017-01-01           FI           0           0
2017-01-01           FR           0           0
2017-01-01           SE           0           0
2017-01-02           AT           0           0
2017-01-02           FI         332           2
2017-01-02           FR        1579           1
2017-01-02           SE           0           0
2017-01-03           AT        3576           1
2017-01-03           FI           0           0
2017-01-03           FR           0           0
2017-01-03           SE         332           1

Solution for multiple values per datetimes with country codes:

Error means data are like:

print (df)
            Article_ID Country_Code  Sold_Units
Date                                           
2017-01-01        3576           AT           1
2017-01-02        1579           FI           1 <-FI
2017-01-02         332           FI           2 <-FI
2017-01-03        3576           AT           1
2017-01-03         332           SE           1

df = (df.groupby(['Date','Country_Code'])
        .sum()
        .unstack(fill_value=0)
        .stack()
        .reset_index(level=1))
print (df)
           Country_Code  Article_ID  Sold_Units
Date                                           
2017-01-01           AT        3576           1
2017-01-01           FI           0           0
2017-01-01           SE           0           0
2017-01-02           AT           0           0
2017-01-02           FI        1911           3
2017-01-02           SE           0           0
2017-01-03           AT        3576           1
2017-01-03           FI           0           0
2017-01-03           SE         332           1

Upvotes: 1

Related Questions