Vincent
Vincent

Reputation: 8826

How to forward fill NaN with a constant sum based on last diff Pandas

Suppose I have a dataframe that looks like

internal_id time_id      membership
a           2017-01-01   10
a           2017-02-01   20
a           2017-03-01
a           2017-04-01
b           2017-01-01   10
b           2017-02-01   15
b           2017-03-01
b           2017-04-01
c           2017-01-01   10
c           2017-02-01   12
c           2017-03-01
c           2017-04-01

I would like to forward fill using the latest diff for each internal_id. So for example, for internal_id = 'a' I would like to forward fill adding 10 each time. For 'b' I would like to forward fill using 5 each time. For 'c' I would like to forward fill 2 each time. So my final df looks like:

internal_id time_id      membership
a           2017-01-01   10
a           2017-02-01   20
a           2017-03-01   30
a           2017-04-01   40
b           2017-01-01   10
b           2017-02-01   15
b           2017-03-01   20
b           2017-04-01   25
c           2017-01-01   10
c           2017-02-01   12
c           2017-03-01   14
c           2017-04-01   16

I tried pd.ffill but that seemed to only be able to fill using constant values. Is there a way I can fill using with a constant sum?

Upvotes: 3

Views: 551

Answers (3)

piRSquared
piRSquared

Reputation: 294338

from numba import njit
import pandas as pd, numpy as np

@njit
def dfill(g, v, n):
    d = np.arange(n) * np.nan
    l = np.arange(n) * np.nan
    c = np.arange(n) * np.nan
    r = np.arange(g.size) * np.nan
    for i in range(g.size):
        x = g[i]
        y = v[i]
        if np.isnan(y):
            c[x] += d[x]
            r[i] = c[x]
        else:
            d[x] = y - l[x]
            r[i] = l[x] = c[x] = y
    return r

g, u = pd.factorize(df.internal_id.values)
df.assign(membership=dfill(g, df.membership.values, u.size))

   internal_id     time_id  membership
0            a  2017-01-01        10.0
1            a  2017-02-01        20.0
2            a  2017-03-01        30.0
3            a  2017-04-01        40.0
4            b  2017-01-01        10.0
5            b  2017-02-01        15.0
6            b  2017-03-01        20.0
7            b  2017-04-01        25.0
8            c  2017-01-01        10.0
9            c  2017-02-01        12.0
10           c  2017-03-01        14.0
11           c  2017-04-01        16.0

Timing
Code Below

      pir         john        scott
10    1.0    61.777804    69.453241
30    1.0   173.136378   183.454086
100   1.0   518.193661   589.781918
300   1.0  1336.147050  1497.396670
1000  1.0  2502.225163  2621.045714

enter image description here

@njit
def dfill(g, v, n):
    d = np.arange(n) * np.nan
    l = np.arange(n) * np.nan
    c = np.arange(n) * np.nan
    r = np.arange(g.size) * np.nan
    for i in range(g.size):
        x = g[i]
        y = v[i]
        if np.isnan(y):
            c[x] += d[x]
            r[i] = c[x]
        else:
            d[x] = y - l[x]
            r[i] = l[x] = c[x] = y
    return r

def pir(d):
    g, u = pd.factorize(d.internal_id.values)
    return d.assign(membership=dfill(g, d.membership.values, u.size))

def john(d):
    return d.assign(membership=d.groupby('internal_id').membership.transform(lambda x: x.ffill() + x.diff().shift().ffill().cumsum().fillna(0)))

def scott(d):
    d2 = (d.groupby('internal_id')
            .transform(lambda x: x.fillna(x.diff().max())
                                  .where(x.index != x.index.min()).cumsum()))
    return d.combine_first(d2)

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000],
    columns='pir john scott'.split(),
    dtype=float
)

k = len(df)
for i in results.index:
    r = np.arange(1, i + 1).repeat(k).astype(str)
    d = pd.concat([df] * i, ignore_index=True)
    d.internal_id += r
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.at[i, j] = timeit(stmt, setp, number=10)

(lambda r: r.div(r.min(1), 0))(results)

Upvotes: 2

Zero
Zero

Reputation: 76927

Here's one approach

In [38]: df.assign(membership=df.groupby('internal_id').membership
           .transform(lambda x: x.ffill() + x.diff().shift().ffill().cumsum().fillna(0)))
Out[38]:
   internal_id     time_id  membership
0            a  2017-01-01        10.0
1            a  2017-02-01        20.0
2            a  2017-03-01        30.0
3            a  2017-04-01        40.0
4            b  2017-01-01        10.0
5            b  2017-02-01        15.0
6            b  2017-03-01        20.0
7            b  2017-04-01        25.0
8            c  2017-01-01        10.0
9            c  2017-02-01        12.0
10           c  2017-03-01        14.0
11           c  2017-04-01        16.0

Upvotes: 2

Scott Boston
Scott Boston

Reputation: 153460

Here is one way to do it, I am not proud of this solution, but I will look for ways to optimize:

d2 = (df.groupby('internal_id')
        .transform(lambda x: x.fillna(x.diff().max())
                              .where(x.index != x.index.min()).cumsum()))
df.combine_first(d2)

Output:

   internal_id  membership     time_id
0            a        10.0  2017-01-01
1            a        20.0  2017-02-01
2            a        30.0  2017-03-01
3            a        40.0  2017-04-01
4            b        10.0  2017-01-01
5            b        15.0  2017-02-01
6            b        20.0  2017-03-01
7            b        25.0  2017-04-01
8            c        10.0  2017-01-01
9            c        12.0  2017-02-01
10           c        14.0  2017-03-01
11           c        16.0  2017-04-01

Upvotes: 1

Related Questions