rahs
rahs

Reputation: 1899

statistics.mean() vs sum()/len() vs np.average() for a list of lists

Data: A list of equal-sized lists that have to be averaged along columns to return one averaged list

Is it faster to average the above-mentioned data in python using one of either statistics.mean() or sum()/len() or is it faster to convert it to a numpy array and then use np.average()?

Or is there no significant difference?

This question provides an answer to which method to use, but does not mention any comparison with alternatives.

Upvotes: 2

Views: 3007

Answers (2)

Dani Mesejo
Dani Mesejo

Reputation: 61920

You can measure the performance of the different proposals. I am assuming than along the columns means that is row-wise. For instance if you have 1000 lists of 100 elements each at the end you are going to have a list with 100 averages.

import random
import numpy as np
import statistics
import timeit

data = [[random.random() for _ in range(100)] for _ in range(1000)]


def average(data):
    return np.average(data, axis=0)


def sum_len(data):
    return [sum(l) / len(l) for l in zip(*data)]


def mean(data):
    return [statistics.mean(l) for l in zip(*data)]


if __name__ == "__main__":
    print(timeit.timeit('average(data)', 'from __main__ import data,average', number=10))
    print(timeit.timeit('sum_len(data)', 'from __main__ import data,sum_len', number=10))
    print(timeit.timeit('mean(data)', 'from __main__ import data,mean', number=10))

Output

0.025441123012569733
0.029354612997849472
1.0484535950090503

It appears that statistics.mean is considerable slower (about 35 times slower) than np.average and the sum_len method and than np.average is marginally faster than sum_len.

Upvotes: 5

mmagnuski
mmagnuski

Reputation: 1275

This may depend on number of elements in 'rows' and 'columns' (that is the number of lists and number of elements in each list), but with as little as 10 lists each having 10 elements you can already see numpy's advantage:

import numpy as np
from statistics import mean

# construct the data
n_rows = 10
n_columns = 10
data = [np.random.random(n_columns).tolist() for x in range(n_rows)]

# define functions, I take your 'along columns' to mean that
# columns dimention is reduced with mean:
def list_mean(data):
    return [mean(x) for x in data]

def numpy_mean(data):
    return np.asarray(data).mean(axis=1)

# time results with %timeit magic in notebook:
%timeit list_mean(data)
# 528 µs ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit numpy_mean(data)
# 19.7 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In that case numpy mean is about 27 times faster than list comprehension, but with larger data numpy speedups will be larger (with 100 lists, 100 elements each numpy is about 70 times faster).

Upvotes: 2

Related Questions