Reputation: 1899
Data: A list of equal-sized lists that have to be averaged along columns to return one averaged list
Is it faster to average the above-mentioned data in python using one of either statistics.mean()
or sum()/len()
or is it faster to convert it to a numpy array and then use np.average()
?
Or is there no significant difference?
This question provides an answer to which method to use, but does not mention any comparison with alternatives.
Upvotes: 2
Views: 3007
Reputation: 61920
You can measure the performance of the different proposals. I am assuming than along the columns means that is row-wise. For instance if you have 1000 lists of 100 elements each at the end you are going to have a list with 100 averages.
import random
import numpy as np
import statistics
import timeit
data = [[random.random() for _ in range(100)] for _ in range(1000)]
def average(data):
return np.average(data, axis=0)
def sum_len(data):
return [sum(l) / len(l) for l in zip(*data)]
def mean(data):
return [statistics.mean(l) for l in zip(*data)]
if __name__ == "__main__":
print(timeit.timeit('average(data)', 'from __main__ import data,average', number=10))
print(timeit.timeit('sum_len(data)', 'from __main__ import data,sum_len', number=10))
print(timeit.timeit('mean(data)', 'from __main__ import data,mean', number=10))
Output
0.025441123012569733
0.029354612997849472
1.0484535950090503
It appears that statistics.mean
is considerable slower (about 35 times slower) than np.average
and the sum_len
method and than np.average
is marginally faster than sum_len
.
Upvotes: 5
Reputation: 1275
This may depend on number of elements in 'rows' and 'columns' (that is the number of lists and number of elements in each list), but with as little as 10 lists each having 10 elements you can already see numpy's advantage:
import numpy as np
from statistics import mean
# construct the data
n_rows = 10
n_columns = 10
data = [np.random.random(n_columns).tolist() for x in range(n_rows)]
# define functions, I take your 'along columns' to mean that
# columns dimention is reduced with mean:
def list_mean(data):
return [mean(x) for x in data]
def numpy_mean(data):
return np.asarray(data).mean(axis=1)
# time results with %timeit magic in notebook:
%timeit list_mean(data)
# 528 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit numpy_mean(data)
# 19.7 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In that case numpy mean is about 27 times faster than list comprehension, but with larger data numpy speedups will be larger (with 100 lists, 100 elements each numpy is about 70 times faster).
Upvotes: 2