Reputation: 353
Given the following dataframe:
df = pd.DataFrame({'list_col': [np.random.randint(0,100,size=(1, np.random.randint(0,10)))[0] for i in range(100000)]})
What would be an optimal way to return the sum of each row? (empty rows = 0)
I read that using .apply is usually discouraged in pandas
df.list_col.apply(sum)
However, when trying to make proper use of vectorized calculations, I was only able to come up with the following:
np.nansum(pd.DataFrame(df.list_col.values.tolist()).values, axis=1)
which turned out to be slower:
So what would be a proper way to use numpy's vectorized calculations on an array of lists of varying sizes?
Upvotes: 2
Views: 63
Reputation: 14094
I think your approach is pretty optimized, you can save a few milliseconds
%timeit df['list_col'].map(sum)
162 ms ± 5.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['list_col'].apply(sum)
156 ms ± 747 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['list_col'].map(np.sum)
306 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I prefer to use map
when it's a series
operation since apply
is usually used for dataframes
.
Upvotes: 0
Reputation: 323306
Consider the speed list
with map
is little faster than others
%timeit df.list_col.apply(sum)
10 loops, best of 3: 130 ms per loop
%timeit np.nansum(pd.DataFrame(df.list_col.values.tolist()).values, axis=1)
1 loop, best of 3: 169 ms per loop
%timeit list(map(sum,df.list_col.tolist()))
10 loops, best of 3: 93.6 ms per loop
Upvotes: 1