Is there a faster way to generate this pandas dataframe?

Question

I have a two dataframes s and sk with around 1M elements and I need to generate a new dataframe df from it where:

df.iloc[i] = s.iloc[f(i)] / sk.iloc[g(i)]

where f and g are functions that return integers.

Currently I'm doing:

data = []
for i in range(s.shape[0])):
    data.append(s.iloc[f(i)] / sk.iloc[g(i)])

df = pd.DataFrame(data, columns=s.columns)

But this seems slow. It's taking about 5 minutes (the dataframes have 9 float columns).

There are only10M divisions, so 5 minutes seems sub-par. All the time seems to be spent iterating s and sk, so I was wondering if there was a way to build s[f] and sk[g] quickly?

edit

f and g are simple functions similar to

def f(i): return math.ceil(i / 23)
def g(i): return math.ceil(i / 23) + ((i - 1) % 23)

user3483203 · Accepted Answer

Your functions are easily vectorized.

def f_vec(i):
    return np.ceil(i / 23).astype(int)

def g_vec(i):
    return (np.ceil(i / 23) + ((i - 1) % 23)).astype(int)

As @Wen points out, we can further optimize this by writing a wrapper to only calculate the ceiling once.

def wrapper(i, a, b):
    cache_ceil = np.ceil(i / 23).astype(int)
    fidx = cache_ceil
    gidx = cache_ceil + ((i - 1) % 23)
    return a.iloc[fidx].to_numpy() / b.iloc[gidx].to_numpy()

Index alignment is also not working in your favor here. If you truly want the elementwise division of the two results, drop down to numpy before dividing:

s.iloc[f_vec(idx)].to_numpy() / sk.iloc[g_vec(idx)].to_numpy()

Now to test out the speed.

Setup

a = np.random.randint(1, 10, (1_000_000, 10))
s = pd.DataFrame(a)
sk = pd.DataFrame(a)
idx = np.arange(1_000_000)

Performance

%timeit s.iloc[f_vec(idx)].to_numpy() / sk.iloc[g_vec(idx)].to_numpy()
265 ms ± 5.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit wrapper(idx, s, sk)
200 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Is there a faster way to generate this pandas dataframe?

Answers (1)

Related Questions