Reputation: 33
I am iterating over a Python dataframe and finding it to be extremely slow. I understand that in Pandas you try to vectorize everything, but in this case I specifically need to iterate (or if it is possible to vectorize, I'm unclear how to do it).
The logic is simple: you have two columns "A" and "B" and a result column "signal." If A equals 1, then you set signal to 1. If B equals 1, then you set signal to 0. Otherwise, signals is whatever it was previously. In other words, column A is an "on" signal, column B is an "off" signal, and "signal" represents the state.
Here is my code:
def signals(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
data['signal'] = 0
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
data['signal'].iloc[i] = 1
elif data['B'].iloc[i] == 1:
data['signal'].iloc[i] = 0
else:
data['signal'].iloc[i] = data['signal'].iloc[i-1]
return data
Example input/output:
indata = pd.DataFrame(index = range(0,10))
indata['A'] = [0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
indata['B'] = [1, 0, 0, 0, 1, 0, 0, 0, 1, 1]
signals(indata)
Output:
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
This simple logic takes my computer 46 seconds to run on a dataframe of 2000 rows with randomly generated data.
Upvotes: 2
Views: 1006
Reputation: 33
The simplest answer to my problem was to not write to the dataframe while iterating through it. I created an array of zeros in numpy, then did my iterative logic in the array. Then I wrote the array to the column in my dataframe.
def signals3(indata):
numrows = len(indata)
data = pd.DataFrame(index= range(0,numrows))
data['A'] = indata['A']
data['B'] = indata['B']
out_signal = np.zeros(numrows)
for i in range(1,numrows):
if data['A'].iloc[i] == 1:
out_signal[i] = 1
elif data['B'].iloc[i] == 1:
out_signal[i] = 0
else:
out_signal[i] = out_signal[i-1]
data['signal'] = out_signal
return data
On a dataframe of 2000 rows of random data, this takes only 43 milliseconds as opposed to 46 seconds (~1,000x faster).
I also tried a variant where I assigned the dataframe columns A and B to series, and then iterated through the series. This was a bit faster (27 milliseconds). But it appears most of the slowness is in writing to a dataframe.
Both coldspeed and djk's answers were faster than my solution (about 4.5ms) but in practice I'll probably just iterate through series even though that is not optimal.
Upvotes: 0
Reputation: 9274
You dont need to iterate at all you can do some Boolean indexing
#set condition for A
indata.loc[indata.A == 1,'signal'] = 1
#set condition for B
indata.loc[indata.B == 1,'signal'] = 0
#forward fill NaN values
indata.signal.fillna(method='ffill',inplace=True)
Upvotes: 1
Reputation: 402844
df['signal'] = df.A.groupby((df.A != df.B).cumsum()).transform('head', 1)
df
A B signal
0 0 1 0
1 1 0 1
2 0 0 1
3 0 0 1
4 0 1 0
5 0 0 0
6 1 0 1
7 0 0 1
8 0 1 0
9 0 1 0
The logic here involves dividing your series into groups based on the inequality between A
and B
, and every group's value is determined by A
.
Upvotes: 2