Reputation: 1327
I have the following DataFrame
:
dates = pd.date_range('20150101', periods=4)
df = pd.DataFrame({'A' : [5,10,3,4]}, index = dates)
df.loc[:,'B'] = 0
df.loc[:,'C'] = 0
df.iloc[0,1] = 10
df.iloc[0,2] = 3
print df
Out[69]:
A B C
2015-01-01 5 10 3
2015-01-02 10 0 0
2015-01-03 3 0 0
2015-01-04 4 0 0
I want to implement the following logic for the columns B
and C
:
B(k+1) = B(k) - A(k+1)
C(k+1) = B(k) + A(k+1)
I can do this using the following code:
for i in range (1, df.shape[0]):
df.iloc[i,1] = df.iloc[i-1,1] - df.iloc[i,0]
df.iloc[i,2] = df.iloc[i-1,1] + df.iloc[i,0]
print df
This gives:
A B C
2015-01-01 5 10 3
2015-01-02 10 0 20
2015-01-03 3 -3 3
2015-01-04 4 -7 1
Which is the answer I'm looking for. The problem is when I apply this to a DataFrame
with a large dataset it runs slow. Very slow. Is there a better way of achieving this?
Upvotes: 10
Views: 8852
Reputation: 1187
You can extract the underlying arrays in numpy and get a much faster, simpler solution. Here we can get a 130x speedup.
dates = pd.date_range('20150101', periods=10000)
df = pd.DataFrame({'A' : np.random.rand(10000)}, index = dates)
df.loc[:,'B'] = 0
df.loc[:,'C'] = 0
df.iloc[0,1] = 10
df.iloc[0,2] = 3
Old, slow, Pandas version:
%%time
for i in range (1, df.shape[0]):
df.iloc[i,1] = df.iloc[i-1,1] - df.iloc[i,0]
df.iloc[i,2] = df.iloc[i-1,1] + df.iloc[i,0]
CPU times: user 6.02 s, sys: 3.33 ms, total: 6.02 s
Wall time: 5.98 s
Better, fast, numpy version:
%%time
arr_a = df['A'].to_numpy()
arr_b = df['B'].to_numpy()
arr_c = df['C'].to_numpy()
for i in range(1, df.shape[0]):
arr_b[i] = arr_b[i - 1] - arr_a[i]
arr_c[i] = arr_b[i - 1] + arr_a[i]
CPU times: user 47.6 ms, sys: 23 µs, total: 47.6 ms
Wall time: 46 ms
The numpy arrays reference the original memory, so changing the numpy arrays modifies the dataframe as well.
Pandas indexing is ludicrously slow. They expect you to use vectorized operations, but for problems like this, it's difficult to get the 3 PhD's required to understand what set of groups, merges, and indexing Pandas would need. Pulling out the underlying arrays and doing the modifications directly is much simpler and faster.
Upvotes: 2
Reputation: 25189
A complete solution:
df1 = df[:1]
df['B'] = df['B'].shift().cumsum()[1:] - df['A'][1:].cumsum()
df[:1] = df1
df['C'] = df['B'].shift() + df['A']
df[:1] = df1
df
A B C
2015-01-01 5 10 3
2015-01-02 10 0 20
2015-01-03 3 -3 3
2015-01-04 4 -7 1
Upvotes: 2
Reputation: 52236
Recursive things like this can be hard to vectorize. numba
usually handles them well - if you need to redistribute your code, cython
may be a better choice as it produces regular c-extensions with no extra dependencies.
In [88]: import numba
In [89]: @numba.jit(nopython=True)
...: def logic(a, b, c):
...: N = len(a)
...: out = np.zeros((N, 2), dtype=np.int64)
...: for i in range(N):
...: if i == 0:
...: out[i, 0] = b[i]
...: out[i, 1] = c[i]
...: else:
...: out[i, 0] = out[i-1,0] - a[i]
...: out[i, 1] = out[i-1,0] + a[i]
...: return out
In [90]: logic(df.A.values, df.B.values, df.C.values)
Out[90]:
array([[10, 3],
[ 0, 20],
[-3, 3],
[-7, 1]], dtype=int64)
In [91]: df[['A','B']] = logic(df.A.values, df.B.values, df.C.values)
Edit: As shown in the other answers, this problem can actually be vectorized, which you should probably use.
Upvotes: 3
Reputation: 375425
A trick to vectorize is to rewrite everything as cumsums.
In [11]: x = df["A"].shift(-1).cumsum().shift().fillna(0)
In [12]: x
Out[12]:
2015-01-01 0
2015-01-02 10
2015-01-03 13
2015-01-04 17
Name: A, dtype: float64
In [13]: df["B"].cumsum() - x
Out[13]:
2015-01-01 10
2015-01-02 0
2015-01-03 -3
2015-01-04 -7
dtype: float64
In [14]: df["B"].cumsum() - x + 2 * df["A"]
Out[14]:
2015-01-01 20
2015-01-02 20
2015-01-03 3
2015-01-04 1
dtype: float64
Note: The first value is a special case so you have to adjust that back to 3.
Upvotes: 7
Reputation: 3058
Basically it is just your answer without the for loop:
df['B'].iloc[1:] = df['B'].iloc[:-1].values - df['A'].iloc[1:].values
df['C'].iloc[1:] = df['B'].iloc[:-1].values + df['A'].iloc[1:].values
I don't know about performance issues, but i guess without the loop it will be faster.
Upvotes: 1