GJ14
GJ14

Reputation: 69

How to vectorise this for loop in Python?

My data has 4 columns A-D which have integers. I am adding a new column E, which takes its first value same as first value in column D. The next value in E should be the corresponding value in column D if previous value in column E is negative, otherwise it takes corresponding value in column C.

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
data=pd.read_excel('/Users/xxxx/Documents/PY Notebooks/Data/yyyy.xlsx')
data1=data.copy()
data1['E']=np.nan
data1.at[0,'E']=data1['D'][0]
l=len(data1)
for i in range(l-1):
    if data1['E'][i]<0:
        data1.at[i+1,'E']=data1['D'][i+1]
    else:
        data1.at[i+1,'E']=data1['C'][i+1]

Upvotes: 4

Views: 229

Answers (2)

Bill Huang
Bill Huang

Reputation: 4648

TL;DR: go to the benchmark code and use Method 1.

Short Answer

No. Vectorization is not possible.

Long Answer

Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.

Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.

Proof: Consider a target row of a dataframe df. Assume there is a backward rolling window with size n < partial length, so a previous value of df["E"] exists before the window. We denote this previous value by state.

Consider a special case: df["C"] == -1 and df["D"] == 1 within the window.

  • Case 1 (state < 0): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
  • Case 2 (state >= 0): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)

Therefore, it is possible for the output df["E"] of the target row to be dependent on a state variable outside the window. QED.

Useful Answer

Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:

import numba

@numba.njit
def my_generator_func():
    ....

Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.

Benchmark

The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7 (10 million) records are generated for benchmarking purposes.

Result:

1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s

> 100x efficiency gain can be achieved against the original code.

Code

from datetime import datetime
import pandas as pd
import numpy as np

n = 10000000  # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())

# ========== Method 1. generator + numba njit ==========
ti = datetime.now()

import numba

@numba.njit
def gen(plus: np.array, minus: np.array):
    l = len(plus)
    assert len(minus) == l
    # first
    state = minus[0]
    yield state
    # second to last
    for i in range(l-1):
        state = minus[i+1] if state < 0 else plus[i+1]
        yield state

df["E"] = [i for i in gen(df["C"].values, df["D"].values)]

tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s")  # 1. numba-njit: 0.47s

# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()

def gen_plain(plus: np.array, minus: np.array):
    l = len(plus)
    assert len(minus) == l
    # first
    state = minus[0]
    yield state
    # second to last
    for i in range(l-1):
        state = minus[i+1] if state < 0 else plus[i+1]
        yield state

df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]

tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s")  #

# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()

# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
    if df['E'][i] < 0:
        df.at[i+1,'E'] = df['D'][i+1]
    else:
        df.at[i+1,'E'] = df['C'][i+1]

tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s

Upvotes: 1

mlang
mlang

Reputation: 762

I don't think you can vectorize this operation as you have dependent rows that need to get previous calculations being run. This being said, there still is quite some room for optimization in your functionality. Let's first check your original implementation with some random points.

import numpy as np
import pandas as pd
import time

size = 10000000
data = np.random.randint(-2, 10, size=size)
data = data.reshape([size//4, 4])

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
for i in range(len(df)-1):
    if df['E'][i]<0:
        df.at[i+1,'E'] = df['D'][i+1]
    else:
        df.at[i+1,'E'] = df['C'][i+1]

print(f"Operation on pd df took {time.time() - time_start} seconds.")

Output:

Operation on pd df took 84.00791883468628 seconds.

As operations on the DataFrame usually are quite slow, we can operate on the underlying numpy arrays instead.

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values

e_vals = [d_vals[0]]
last_e = e_vals[0]
for i in range(len(df)-1):
    if last_e < 0:
        last_e = d_vals[i+1]
    else:
        last_e = c_vals[i+1]
    e_vals.append(last_e)

df['E'] = e_vals

print(f"Operation on np array took {time.time() - time_start} seconds.")

Output:

Operation on np array took 2.2387869358062744 seconds.

Now we can argue that for loops are slow in Python and we can use a JIT compiler that can deal with numpy arrays, for instance numba.

import numba

time_start = time.time()
df = pd.DataFrame(data=data, columns=["A", "B", "C", "D"])
c_vals = df["C"].values
d_vals = df["D"].values

@numba.jit(nopython=True)
def numba_calc(c_vals, d_vals):
    e_vals = [d_vals[0]]
    last_e = e_vals[0]
    for i in range(len(c_vals)-1):
        if last_e < 0:
            last_e = d_vals[i+1]
        else:
            last_e = c_vals[i+1]
        e_vals.append(last_e)

    return e_vals

df["E"] = numba_calc(c_vals, d_vals)

print(f"Operation on np array with numba took {time.time() - time_start} seconds.")

Output:

Operation on np array with numba took 1.2623450756072998 seconds.

So especially for larger DataFrames using numba will pay out, while operating on the raw numpy arrays mostly gives a nice runtime improvement.

Upvotes: 0

Related Questions