Erlinska
Erlinska

Reputation: 433

Replace 0 by the last value seen in the array (similar to the fillna method with NaN values) in pandas dataframes

I've been trying to create a new boolean array from two initial boolean arrays (df1 and df2) , following this rule:

-If the last True value was seen in df1, the value in df3 is True -If the last True value was seen in df2, the value in df3 is False

I found a way to do this by substracting the two arrays, then remplacing the 0 values by NaN values, using the .fillnamethod to fill the NaN values, and finally remplacing the -1 values by 0.

The code looks like this (with random arrays given as examples):

import pandas as pd import matplotlib.pyplot as plt import numpy as np

df1 = pd.Series((1,1,1,0,0,0,0,0,0,0,
                 0,0,0,0,1,0,0,1,0,0,
                 0,0,0,0,0,1,1,1,0,0,
                 0,0,0,0,0,0,0,0,0,0,
                 0,0,0,1,1,1,0,0,0,0))


df2 = pd.Series((0,0,0,0,0,0,1,1,0,0,
                 0,0,0,0,0,0,0,0,0,0,
                 1,0,0,0,0,0,0,0,0,0,
                 0,0,0,1,0,0,1,0,0,0,
                 0,0,0,0,0,0,0,0,0,0))

df3 = df1-df2
df3 = df3.replace(0,np.nan).fillna(method='ffill')
df3 = df3.replace(-1,0)

It does the trick and returns me the array I want, however it seems not optimized to remplace values several times like this, and working around thefillna method to do what I want.

Would anyone know a way to do this more 'properly', and if there is a method similar to the fillna one, but on other values than NaN?

Optimization is really important on this problem, as I'll go through this process a lot of time.

Upvotes: 2

Views: 119

Answers (1)

JohnE
JohnE

Reputation: 30444

This should be a bit faster (about 6x for 200,000 rows but check for yourself, of course). After import numpy as np:

arr = np.select( [df1==1,df2==1], [1,0], default=np.nan )
ser = pd.Series( arr ).ffill()

If you aren't familiar with np.select it's pretty similar to np.where but allows multiple conditions. You could also do a nested np.where but this is more readable.

Below is arr, and what allows you to do a simple ffill without needing to use replace twice is that you have nan's instead of zeroes.

array([  1.,   1.,   1.,  nan,  nan,  nan,   0.,   0.,  nan,  nan,  nan,
        nan,  nan,  nan,   1.,  nan,  nan,   1.,  nan,  nan,   0.,  nan,
        nan,  nan,  nan,   1.,   1.,   1.,  nan,  nan,  nan,  nan,  nan,
         0.,  nan,  nan,   0.,  nan,  nan,  nan,  nan,  nan,  nan,   1.,
         1.,   1.,  nan,  nan,  nan,  nan])

You might also get a little extra speed by storing df1 & df2 as numpy arrays rather than dataframes/series, but I don't think it would matter too much.

If speed is really a concern, you should try to do everything in numpy rather than pandas (when possible), but I'm not sure of a good numpy analog for pandas ffill.

Upvotes: 1

Related Questions