Most optimized way to perform calculations on rows of lists of various sizes

Question

Given the following dataframe:

df = pd.DataFrame({'list_col': [np.random.randint(0,100,size=(1, np.random.randint(0,10)))[0] for i in range(100000)]})

What would be an optimal way to return the sum of each row? (empty rows = 0)

I read that using .apply is usually discouraged in pandas

df.list_col.apply(sum)

However, when trying to make proper use of vectorized calculations, I was only able to come up with the following:

np.nansum(pd.DataFrame(df.list_col.values.tolist()).values, axis=1)

which turned out to be slower:

So what would be a proper way to use numpy's vectorized calculations on an array of lists of varying sizes?

BENY · Accepted Answer

Consider the speed list with map is little faster than others

%timeit df.list_col.apply(sum)
10 loops, best of 3: 130 ms per loop
%timeit np.nansum(pd.DataFrame(df.list_col.values.tolist()).values, axis=1)
1 loop, best of 3: 169 ms per loop
%timeit list(map(sum,df.list_col.tolist()))
10 loops, best of 3: 93.6 ms per loop

Most optimized way to perform calculations on rows of lists of various sizes

Answers (2)

Related Questions