Reputation: 433
I've been trying to create a new boolean array from two initial boolean arrays (df1 and df2) , following this rule:
-If the last True value was seen in df1, the value in df3 is True -If the last True value was seen in df2, the value in df3 is False
I found a way to do this by substracting the two arrays, then remplacing the 0 values by NaN values, using the .fillna
method to fill the NaN values, and finally remplacing the -1 values by 0.
The code looks like this (with random arrays given as examples):
import pandas as pd import matplotlib.pyplot as plt import numpy as np
df1 = pd.Series((1,1,1,0,0,0,0,0,0,0,
0,0,0,0,1,0,0,1,0,0,
0,0,0,0,0,1,1,1,0,0,
0,0,0,0,0,0,0,0,0,0,
0,0,0,1,1,1,0,0,0,0))
df2 = pd.Series((0,0,0,0,0,0,1,1,0,0,
0,0,0,0,0,0,0,0,0,0,
1,0,0,0,0,0,0,0,0,0,
0,0,0,1,0,0,1,0,0,0,
0,0,0,0,0,0,0,0,0,0))
df3 = df1-df2
df3 = df3.replace(0,np.nan).fillna(method='ffill')
df3 = df3.replace(-1,0)
It does the trick and returns me the array I want, however it seems not optimized to remplace values several times like this, and working around thefillna
method to do what I want.
Would anyone know a way to do this more 'properly', and if there is a method similar to the fillna
one, but on other values than NaN?
Optimization is really important on this problem, as I'll go through this process a lot of time.
Upvotes: 2
Views: 119
Reputation: 30444
This should be a bit faster (about 6x for 200,000 rows but check for yourself, of course). After import numpy as np
:
arr = np.select( [df1==1,df2==1], [1,0], default=np.nan )
ser = pd.Series( arr ).ffill()
If you aren't familiar with np.select
it's pretty similar to np.where
but allows multiple conditions. You could also do a nested np.where
but this is more readable.
Below is arr, and what allows you to do a simple ffill
without needing to use replace
twice is that you have nan
's instead of zeroes.
array([ 1., 1., 1., nan, nan, nan, 0., 0., nan, nan, nan,
nan, nan, nan, 1., nan, nan, 1., nan, nan, 0., nan,
nan, nan, nan, 1., 1., 1., nan, nan, nan, nan, nan,
0., nan, nan, 0., nan, nan, nan, nan, nan, nan, 1.,
1., 1., nan, nan, nan, nan])
You might also get a little extra speed by storing df1 & df2 as numpy arrays rather than dataframes/series, but I don't think it would matter too much.
If speed is really a concern, you should try to do everything in numpy rather than pandas (when possible), but I'm not sure of a good numpy analog for pandas ffill
.
Upvotes: 1