Anthony W
Anthony W

Reputation: 1327

Speeding up an iloc solution within a pandas dataframe

I have the following DataFrame:

dates = pd.date_range('20150101', periods=4)
df = pd.DataFrame({'A' : [5,10,3,4]}, index = dates)

df.loc[:,'B'] = 0
df.loc[:,'C'] = 0
df.iloc[0,1]  = 10
df.iloc[0,2]  = 3

print df

Out[69]:

             A   B  C
2015-01-01   5  10  3
2015-01-02  10   0  0
2015-01-03   3   0  0
2015-01-04   4   0  0

I want to implement the following logic for the columns B and C:

I can do this using the following code:

for i in range (1, df.shape[0]): 
        df.iloc[i,1] = df.iloc[i-1,1] - df.iloc[i,0] 
        df.iloc[i,2] = df.iloc[i-1,1] + df.iloc[i,0] 
print df

This gives:

             A   B   C
2015-01-01   5  10   3
2015-01-02  10   0  20
2015-01-03   3  -3   3
2015-01-04   4  -7   1

Which is the answer I'm looking for. The problem is when I apply this to a DataFrame with a large dataset it runs slow. Very slow. Is there a better way of achieving this?

Upvotes: 10

Views: 8852

Answers (5)

Stu Gla
Stu Gla

Reputation: 1187

You can extract the underlying arrays in numpy and get a much faster, simpler solution. Here we can get a 130x speedup.

dates = pd.date_range('20150101', periods=10000)
df = pd.DataFrame({'A' : np.random.rand(10000)}, index = dates)

df.loc[:,'B'] = 0
df.loc[:,'C'] = 0
df.iloc[0,1]  = 10
df.iloc[0,2]  = 3

Old, slow, Pandas version:

%%time
for i in range (1, df.shape[0]): 
        df.iloc[i,1] = df.iloc[i-1,1] - df.iloc[i,0] 
        df.iloc[i,2] = df.iloc[i-1,1] + df.iloc[i,0]


CPU times: user 6.02 s, sys: 3.33 ms, total: 6.02 s
Wall time: 5.98 s

Better, fast, numpy version:

%%time
arr_a = df['A'].to_numpy()
arr_b = df['B'].to_numpy()
arr_c = df['C'].to_numpy()
for i in range(1, df.shape[0]):
    arr_b[i] = arr_b[i - 1] - arr_a[i]
    arr_c[i] = arr_b[i - 1] + arr_a[i]


CPU times: user 47.6 ms, sys: 23 µs, total: 47.6 ms
Wall time: 46 ms

The numpy arrays reference the original memory, so changing the numpy arrays modifies the dataframe as well.

Pandas indexing is ludicrously slow. They expect you to use vectorized operations, but for problems like this, it's difficult to get the 3 PhD's required to understand what set of groups, merges, and indexing Pandas would need. Pulling out the underlying arrays and doing the modifications directly is much simpler and faster.

Upvotes: 2

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25189

A complete solution:

df1 = df[:1]
df['B'] = df['B'].shift().cumsum()[1:] - df['A'][1:].cumsum()
df[:1] = df1
df['C'] = df['B'].shift() + df['A']
df[:1] = df1
df

            A   B   C
2015-01-01  5   10  3
2015-01-02  10  0   20
2015-01-03  3   -3  3
2015-01-04  4   -7  1

Upvotes: 2

chrisb
chrisb

Reputation: 52236

Recursive things like this can be hard to vectorize. numba usually handles them well - if you need to redistribute your code, cython may be a better choice as it produces regular c-extensions with no extra dependencies.

In [88]: import numba

In [89]: @numba.jit(nopython=True)
    ...: def logic(a, b, c):
    ...:     N = len(a)
    ...:     out = np.zeros((N, 2), dtype=np.int64)
    ...:     for i in range(N):
    ...:         if i == 0:
    ...:             out[i, 0] = b[i]
    ...:             out[i, 1] = c[i]
    ...:         else:
    ...:             out[i, 0] = out[i-1,0] - a[i]
    ...:             out[i, 1] = out[i-1,0] + a[i]
    ...:     return out

In [90]: logic(df.A.values, df.B.values, df.C.values)
Out[90]: 
array([[10,  3],
       [ 0, 20],
       [-3,  3],
       [-7,  1]], dtype=int64)

In [91]: df[['A','B']] = logic(df.A.values, df.B.values, df.C.values)

Edit: As shown in the other answers, this problem can actually be vectorized, which you should probably use.

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375425

A trick to vectorize is to rewrite everything as cumsums.

In [11]: x = df["A"].shift(-1).cumsum().shift().fillna(0)

In [12]: x
Out[12]:
2015-01-01     0
2015-01-02    10
2015-01-03    13
2015-01-04    17
Name: A, dtype: float64

In [13]: df["B"].cumsum() - x
Out[13]:
2015-01-01    10
2015-01-02     0
2015-01-03    -3
2015-01-04    -7
dtype: float64

In [14]: df["B"].cumsum() - x + 2 * df["A"]
Out[14]:
2015-01-01    20
2015-01-02    20
2015-01-03     3
2015-01-04     1
dtype: float64

Note: The first value is a special case so you have to adjust that back to 3.

Upvotes: 7

Elad Joseph
Elad Joseph

Reputation: 3058

Basically it is just your answer without the for loop:

df['B'].iloc[1:] = df['B'].iloc[:-1].values - df['A'].iloc[1:].values
df['C'].iloc[1:] = df['B'].iloc[:-1].values + df['A'].iloc[1:].values

I don't know about performance issues, but i guess without the loop it will be faster.

Upvotes: 1

Related Questions