Reputation: 656
Below is a portion of mydataframe
which has many missing values.
A B
S a b c d e a b c d e
date
2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN
2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN
2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
I would like to replace the NANs
in each column using a specific backward fill condition
.
For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is '4' against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.
I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.
I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow
. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.
Appreciate any inputs.
Upvotes: 3
Views: 649
Reputation: 120391
Initial data:
>>> df
A B
a b c d e a b c d e
date
2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN
2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN
2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
Define your geometric sequence:
def geomseq(seq):
q = 1.1
n = len(seq)
S = seq.max()
Uo = S * (1-q) / (1-q**n)
Un = [Uo * q**i for i in range(0, n)]
return Un
TL;DR
>>> df.unstack().groupby(df.unstack().sort_index(ascending=False).notna().cumsum().sort_index()).transform(geomseq).unstack(level=[0, 1])
A B
a b c d e a b c d e
date
2020-10-15 1.000000 2.000000 0.518430 1.208459 1.166466 10.000000 11.000000 1.684896 3.927492 1.684896
2020-10-16 0.655190 0.982785 0.570272 1.329305 1.283113 1.637975 0.327595 1.853386 4.320242 1.853386
2020-10-17 0.720709 1.081063 0.627300 1.462236 1.411424 1.801772 0.360354 2.038724 4.752266 2.038724
2020-10-18 0.792780 1.189170 0.690030 0.302115 1.552567 1.981950 0.396390 2.242597 1.208459 2.242597
2020-10-19 0.872058 1.308087 0.759033 0.332326 1.707823 2.180144 0.436029 2.466856 1.329305 2.466856
2020-10-20 0.959264 1.438895 0.834936 0.365559 1.878606 2.398159 0.479632 2.713542 1.462236 2.713542
Details
Convert your dataframe
to series
:
>>> sr = df.unstack()
>>> sr.head(10)
date
A a 2020-10-15 1.0
2020-10-16 NaN # <= group X (final value: .655)
2020-10-17 NaN # <= group X (final value: .720)
2020-10-18 NaN # <= group X (final value: .793)
2020-10-19 NaN # <= group X (final value: .872)
2020-10-20 4.0 # <= group X (final value: .960)
b 2020-10-15 2.0
2020-10-16 NaN
2020-10-17 NaN
2020-10-18 NaN
dtype: float64
Now you can build groups:
>>> groups = sr.sort_index(ascending=False).notna().cumsum().sort_index()
>>> groups.head(10)
date
A a 2020-10-15 16
2020-10-16 15 # <= group X15
2020-10-17 15 # <= group X15
2020-10-18 15 # <= group X15
2020-10-19 15 # <= group X15
2020-10-20 15 # <= group X15
b 2020-10-15 14
2020-10-16 13
2020-10-17 13
2020-10-18 13
dtype: int64
Apply your geometric progression:
>>> sr = sr.groupby(groups).transform(geomseq)
>>> sr.head(10)
date
A a 2020-10-15 1.000000
2020-10-16 0.655190 # <= group X15
2020-10-17 0.720709 # <= group X15
2020-10-18 0.792780 # <= group X15
2020-10-19 0.872058 # <= group X15
2020-10-20 0.959264 # <= group X15
b 2020-10-15 2.000000
2020-10-16 0.982785
2020-10-17 1.081063
2020-10-18 1.189170
dtype: float64
And finally, reshape series
according to your initial dataframe
:
>>> df = sr.unstack(level=[0, 1])
>>> df
A B
a b c d e a b c d e
date
2020-10-15 1.000000 2.000000 0.518430 1.208459 1.166466 10.000000 11.000000 1.684896 3.927492 1.684896
2020-10-16 0.655190 0.982785 0.570272 1.329305 1.283113 1.637975 0.327595 1.853386 4.320242 1.853386
2020-10-17 0.720709 1.081063 0.627300 1.462236 1.411424 1.801772 0.360354 2.038724 4.752266 2.038724
2020-10-18 0.792780 1.189170 0.690030 0.302115 1.552567 1.981950 0.396390 2.242597 1.208459 2.242597
2020-10-19 0.872058 1.308087 0.759033 0.332326 1.707823 2.180144 0.436029 2.466856 1.329305 2.466856
2020-10-20 0.959264 1.438895 0.834936 0.365559 1.878606 2.398159 0.479632 2.713542 1.462236 2.713542
Upvotes: 1
Reputation: 26201
Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy
.
First, let's dig a bit into the requirements, specifically what exactly the value for each cell needs to be.
The example given is [nan, nan, nan, nan, 4.0]
--> [.66, .72, .79, .87, .96]
, which is explained to be a "progressively increasing value of 10%" (in such a way that the total is the "value to spread": 4.0
).
This is a geometric series with rate r = 1 + 0.1
: [r^1, r^2, r^3, ...]
and then normalized to sum to 1. For example:
r = 1.1
a = 4.0
n = 5
q = np.cumprod(np.repeat(r, n))
a * q / q.sum()
# array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])
We'd like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum()
in closed form. It is a well-established quantity and is:
To generalize, we need 3 quantities to calculate the value of each cell:
a
: value to distributei
: index of run (0 .. n-1)n
: run lengthv = a * r**i * (r - 1) / (r**n - 1)
.To illustrate with the first column in the OP's example, where the input is: [1, nan, nan, nan, nan, 4]
, we would like:
a = [1, 4, 4, 4, 4, 4]
i = [0, 0, 1, 2, 3, 4]
n = [1, 5, 5, 5, 5, 5]
v
would be (rounded at 2 decimals): [1. , 0.66, 0.72, 0.79, 0.87, 0.96]
.Now comes the part where we go about getting these three quantities as numpy arrays.
a
is the easiest and is simply df.bfill().values
. But for i
and n
, we do have to do a little bit of work, starting by assigning the values to a numpy array:
z = df.values
nrows, ncols = z.shape
For i
, we start with the cumulative count of NaN
s, with reset when values are not NaN
. This is strongly inspired by this SO answer for "Cumulative counts in NumPy without iteration". But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:
def rcount(z):
na = np.isnan(z)
without_reset = na.cumsum(axis=0)
reset_at = ~na
overcount = np.maximum.accumulate(without_reset * reset_at)
result = without_reset - overcount
return result
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
For n
, we need to do some dancing on our own, using first principles of numpy (I'll break down the steps if I have time):
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
So, putting it all together:
def spread_bfill(df, r=1.1):
z = df.values
nrows, ncols = z.shape
a = df.bfill().values
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
v = a * r**i * (r - 1) / (r**n - 1)
return pd.DataFrame(v, columns=df.columns, index=df.index)
On your example data, we get:
>>> spread_bfill(df).round(2) # round(2) for printing purposes
A B
a b c d e a b c d e
S
2020-10-15 1.00 2.00 0.52 1.21 1.17 10.00 11.00 1.68 3.93 1.68
2020-10-16 0.66 0.98 0.57 1.33 1.28 1.64 0.33 1.85 4.32 1.85
2020-10-17 0.72 1.08 0.63 1.46 1.41 1.80 0.36 2.04 4.75 2.04
2020-10-18 0.79 1.19 0.69 0.30 1.55 1.98 0.40 2.24 1.21 2.24
2020-10-19 0.87 1.31 0.76 0.33 1.71 2.18 0.44 2.47 1.33 2.47
2020-10-20 0.96 1.44 0.83 0.37 1.88 2.40 0.48 2.71 1.46 2.71
For inspection, let's look at each of the 3 quantities in that example:
>>> a
[[ 1 2 4 4 9 10 11 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]]
>>> i
[[0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 1 0 0 1 1 1]
[1 1 2 2 2 1 1 2 2 2]
[2 2 3 0 3 2 2 3 0 3]
[3 3 4 1 4 3 3 4 1 4]
[4 4 5 2 5 4 4 5 2 5]]
>>> n
[[1 1 6 3 6 1 1 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]]
And here is a final example, to illustrate what happens if a column ends with 1 or several NaN
s (they remain NaN
):
np.random.seed(10)
a = np.random.randint(0, 10, (6, 6)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
>>> df
0 1 2 3 4 5
0 NaN NaN NaN NaN NaN 0.0
1 NaN NaN 9.0 NaN 8.0 NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN 8.0 4.0 NaN NaN NaN
4 NaN NaN NaN 6.0 9.0 NaN
5 NaN NaN 2.0 NaN 7.0 8.0
Then:
>>> spread_bfill(df).round(2) # round(2) for printing
0 1 2 3 4 5
0 NaN 1.72 4.29 0.98 3.81 0.00
1 NaN 1.90 4.71 1.08 4.19 1.31
2 NaN 2.09 1.90 1.19 2.72 1.44
3 NaN 2.29 2.10 1.31 2.99 1.59
4 NaN NaN 0.95 1.44 3.29 1.74
5 NaN NaN 1.05 NaN 7.00 1.92
Speed
a = np.random.randint(0, 10, (1000, 1000)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
%timeit spread_bfill(df)
# 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 3