dholstius
dholstius

Reputation: 1007

Forward-fill dates within MultiIndexed DataFrame

I have a PANDAs DataFrame with a MultiIndex, where one of the levels represents a year:

import pandas as pd
df = pd.DataFrame(dict(A = ['foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
                       B = ['white', 'black', 'white', 'white', 'black', 'black'],
                       year = [1990, 1992, 1990, 1992, 1991, 1992],
                       value = [3.14, 1.20, 4.56, 6.79, 0.01, 0.02]))

df = df.set_index(['A', 'B', 'year'])

I would like to forward-fill values, but only for the intervening years in each group (defined by the interaction of A and B). Here is the input:

                value
A   B     year       
foo white 1990   3.14
    black 1992   1.20
bar white 1990   4.56
          1992   6.79
    black 1991   0.01
          1992   0.02

And here is the desired output, with one additional row:

                value
A   B     year       
foo white 1990   3.14
    black 1992   1.20
bar white 1990   4.56
          1991   4.56  <-- new forward-filled value
          1992   6.79
    black 1991   0.01
          1992   0.02

How can I accomplish this concisely and efficiently? I've tried using combinations of groupby and apply, but I'm new to PANDAS and keep throwing Exceptions.

Here's an example of how I'm naively approaching the problem:

def ffill_years(df):
    df.reset_index(['A', 'B'])  # drop all but 'year'
    year_range = range(df['year'].min(), df['year'].max())
    df.reindex(pd.Series(years)).fillna("ffill")
    return df

df.groupby(level=['A', 'B']).apply(ffill_years)

Of course this doesn't work. Any and all tips appreciated!

Upvotes: 3

Views: 474

Answers (1)

chrisb
chrisb

Reputation: 52236

You were pretty close - a couple small changes:

  1. reset_index doesn't operate in place
  2. Can't reference index by name, need to use .index
  3. Need a +1 on your range to include the endpoint
  4. reindex is also not in-place
  5. First parameter to fillna is a fill value, use keyword method

See below:

def ffill_years(df):
    df = df.reset_index(['A','B'])  # drop all but 'year'
    year_range = range(df.index.min(), df.index.max() + 1)

    df = df.reindex(pd.Series(year_range)).fillna(method='ffill')
    return df

Results in

In [209]: df.groupby(level=['A','B']).apply(ffill_years)
Out[209]: 
                  A      B  value
A   B     year                   
bar black 1991  bar  black   0.01
          1992  bar  black   0.02
    white 1990  bar  white   4.56
          1991  bar  white   4.56
          1992  bar  white   6.79
foo black 1992  foo  black   1.20
    white 1990  foo  white   3.14

Upvotes: 3

Related Questions