Reputation: 8826
Suppose I have a dataframe that looks like
internal_id time_id membership
a 2017-01-01 10
a 2017-02-01 20
a 2017-03-01
a 2017-04-01
b 2017-01-01 10
b 2017-02-01 15
b 2017-03-01
b 2017-04-01
c 2017-01-01 10
c 2017-02-01 12
c 2017-03-01
c 2017-04-01
I would like to forward fill using the latest diff for each internal_id
. So for example, for internal_id = 'a'
I would like to forward fill adding 10 each time. For 'b' I would like to forward fill using 5 each time. For 'c' I would like to forward fill 2 each time. So my final df looks like:
internal_id time_id membership
a 2017-01-01 10
a 2017-02-01 20
a 2017-03-01 30
a 2017-04-01 40
b 2017-01-01 10
b 2017-02-01 15
b 2017-03-01 20
b 2017-04-01 25
c 2017-01-01 10
c 2017-02-01 12
c 2017-03-01 14
c 2017-04-01 16
I tried pd.ffill
but that seemed to only be able to fill using constant values. Is there a way I can fill using with a constant sum?
Upvotes: 3
Views: 551
Reputation: 294338
from numba import njit
import pandas as pd, numpy as np
@njit
def dfill(g, v, n):
d = np.arange(n) * np.nan
l = np.arange(n) * np.nan
c = np.arange(n) * np.nan
r = np.arange(g.size) * np.nan
for i in range(g.size):
x = g[i]
y = v[i]
if np.isnan(y):
c[x] += d[x]
r[i] = c[x]
else:
d[x] = y - l[x]
r[i] = l[x] = c[x] = y
return r
g, u = pd.factorize(df.internal_id.values)
df.assign(membership=dfill(g, df.membership.values, u.size))
internal_id time_id membership
0 a 2017-01-01 10.0
1 a 2017-02-01 20.0
2 a 2017-03-01 30.0
3 a 2017-04-01 40.0
4 b 2017-01-01 10.0
5 b 2017-02-01 15.0
6 b 2017-03-01 20.0
7 b 2017-04-01 25.0
8 c 2017-01-01 10.0
9 c 2017-02-01 12.0
10 c 2017-03-01 14.0
11 c 2017-04-01 16.0
Timing
Code Below
pir john scott
10 1.0 61.777804 69.453241
30 1.0 173.136378 183.454086
100 1.0 518.193661 589.781918
300 1.0 1336.147050 1497.396670
1000 1.0 2502.225163 2621.045714
@njit
def dfill(g, v, n):
d = np.arange(n) * np.nan
l = np.arange(n) * np.nan
c = np.arange(n) * np.nan
r = np.arange(g.size) * np.nan
for i in range(g.size):
x = g[i]
y = v[i]
if np.isnan(y):
c[x] += d[x]
r[i] = c[x]
else:
d[x] = y - l[x]
r[i] = l[x] = c[x] = y
return r
def pir(d):
g, u = pd.factorize(d.internal_id.values)
return d.assign(membership=dfill(g, d.membership.values, u.size))
def john(d):
return d.assign(membership=d.groupby('internal_id').membership.transform(lambda x: x.ffill() + x.diff().shift().ffill().cumsum().fillna(0)))
def scott(d):
d2 = (d.groupby('internal_id')
.transform(lambda x: x.fillna(x.diff().max())
.where(x.index != x.index.min()).cumsum()))
return d.combine_first(d2)
results = pd.DataFrame(
index=[10, 30, 100, 300, 1000],
columns='pir john scott'.split(),
dtype=float
)
k = len(df)
for i in results.index:
r = np.arange(1, i + 1).repeat(k).astype(str)
d = pd.concat([df] * i, ignore_index=True)
d.internal_id += r
for j in results.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
results.at[i, j] = timeit(stmt, setp, number=10)
(lambda r: r.div(r.min(1), 0))(results)
Upvotes: 2
Reputation: 76927
Here's one approach
In [38]: df.assign(membership=df.groupby('internal_id').membership
.transform(lambda x: x.ffill() + x.diff().shift().ffill().cumsum().fillna(0)))
Out[38]:
internal_id time_id membership
0 a 2017-01-01 10.0
1 a 2017-02-01 20.0
2 a 2017-03-01 30.0
3 a 2017-04-01 40.0
4 b 2017-01-01 10.0
5 b 2017-02-01 15.0
6 b 2017-03-01 20.0
7 b 2017-04-01 25.0
8 c 2017-01-01 10.0
9 c 2017-02-01 12.0
10 c 2017-03-01 14.0
11 c 2017-04-01 16.0
Upvotes: 2
Reputation: 153460
Here is one way to do it, I am not proud of this solution, but I will look for ways to optimize:
d2 = (df.groupby('internal_id')
.transform(lambda x: x.fillna(x.diff().max())
.where(x.index != x.index.min()).cumsum()))
df.combine_first(d2)
Output:
internal_id membership time_id
0 a 10.0 2017-01-01
1 a 20.0 2017-02-01
2 a 30.0 2017-03-01
3 a 40.0 2017-04-01
4 b 10.0 2017-01-01
5 b 15.0 2017-02-01
6 b 20.0 2017-03-01
7 b 25.0 2017-04-01
8 c 10.0 2017-01-01
9 c 12.0 2017-02-01
10 c 14.0 2017-03-01
11 c 16.0 2017-04-01
Upvotes: 1