sousben
sousben

Reputation: 353

Most optimized way to perform calculations on rows of lists of various sizes

Given the following dataframe:

df = pd.DataFrame({'list_col': [np.random.randint(0,100,size=(1, np.random.randint(0,10)))[0] for i in range(100000)]})

enter image description here

What would be an optimal way to return the sum of each row? (empty rows = 0)

I read that using .apply is usually discouraged in pandas

df.list_col.apply(sum)

enter image description here

However, when trying to make proper use of vectorized calculations, I was only able to come up with the following:

np.nansum(pd.DataFrame(df.list_col.values.tolist()).values, axis=1)

which turned out to be slower: enter image description here

So what would be a proper way to use numpy's vectorized calculations on an array of lists of varying sizes?

Upvotes: 2

Views: 63

Answers (2)

Kenan
Kenan

Reputation: 14094

I think your approach is pretty optimized, you can save a few milliseconds

%timeit df['list_col'].map(sum)
162 ms ± 5.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['list_col'].apply(sum)
156 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['list_col'].map(np.sum)
306 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I prefer to use map when it's a series operation since apply is usually used for dataframes.

Upvotes: 0

BENY
BENY

Reputation: 323306

Consider the speed list with map is little faster than others

%timeit df.list_col.apply(sum)
10 loops, best of 3: 130 ms per loop
%timeit np.nansum(pd.DataFrame(df.list_col.values.tolist()).values, axis=1)
1 loop, best of 3: 169 ms per loop
%timeit list(map(sum,df.list_col.tolist()))
10 loops, best of 3: 93.6 ms per loop

Upvotes: 1

Related Questions