Reputation: 47
I want to calculate pi, which summarizes several labels in a specific class / total number of labels in the GaussianMixture model.
tr_y is a pandas data frame
index | labels |
---|---|
0 | 6 |
1 | 5 |
2 | 6 |
3 | 5 |
4 | 6 |
1000 rows × 1 column.
then I try to compare two approaches:
%%timeit
y_list = tr_y.values.flatten().tolist()
>>> 12.3 µs ± 193 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)
54.9 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
arr = tr_y.to_numpy()
>>> 4.55 µs ± 92.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
sum([1 for i in arr if i == 5 ])/arr.__len__()
>>> 883 µs ± 48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Updated approach if it converts NumPy to list using tolist()
is much faster than the two previous approaches.
arr = tr_y.to_numpy().tolist()
%%timeit
sum([1 for i in arr if i == 5 ])/arr.__len__()
>>> 43.1 µs ± 410 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
source of the last approach
So using lists is faster than using NumPy array.
I have searched for this, and I find that NumPy has to wrap the returned object with a python type (e.g., NumPy.float64 or NumPy.int64 in this case), which takes time if you're iterating item-by-item1. Further proof of this is demonstrated when iterating -- We see that we're alternating between 2 separate IDs while iterating over the array. This means that python's memory allocator and garbage collector work overtime to create new objects and then free them.
A list doesn't have this memory allocator/garbage collector overhead. The objects in the list already exist as python objects (and they'll still exist after iteration), so neither plays any role in the iteration over a list.
My search concludes that if we need to work on multidimensional matrices or do some vectorization, we have to use the NumPy array faster and take less memory. Is that true?
Another thing that I want to calculate is the memory consumption for both the Numpy array and list. Still, I find that sys.sizeOf
is not reliable and gives us the size of pointers array and header of the container, which there is more and more consideration. Is there any reliable method to calculate the memory consumption?
another investigation that when I convert NumPy array to list I convert it to row matrix which is uploaded into L1 cache at once rather than col vector which makes a lot of misses inside L1 cach. source
so what if we use a vector in Fortran order?
Upvotes: 3
Views: 837
Reputation: 444
Use np.sum()
like this:
np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)
Now lets do some some experiments. We will use the following setup:
import pandas as pd
import numpy as np
tr_y = pd.DataFrame({'labels': [6, 5, 6, 5, 6]*200000})
We use a larger dataset to see whether the methods scale well to larger inputs. Here we will have a dataset with 1,000,000 rows. We will try a couple of different methods and see how they perform.
The worst performer is:
sum(tr_y.labels.to_numpy()==5)/len(tr_y)
1.91 s ± 42.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The next option is on average 14 times faster:
y_list = tr_y.to_numpy().tolist()
sum([1 if y == 5 else 0 for y in y_list]) / len(y_list)
132 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
After that we get a 1.6 times increase with:
sum(tr_y.labels==5)/len(tr_y)
79.3 ms ± 796 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
None of these methods however are optimised with numpy. They use numpy arrays but are bogged down by python's sum()
. If we use the optimised NumPy
version we get:
np.sum(tr_y.labels.to_numpy()==5)/len(tr_y)
1.36 ms ± 6.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This operation was on average 58 times faster than our previous best. This is more like the power of NumPy
that we were promised. By using np.sum()
instead of python's standard sum()
, we are able to do the same operation about 1,400 times faster (1.9 s vs 1.4 ms)
Since Pandas series are built on NumPy
arrays the following code gives very similar performance to our optimal setup.:
np.sum(tr_y.labels==5)/len(tr_y)
1.83 ms ± 39.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Unless optimizing your code is essential, I would personally go for this option as it is the clearest to read without losing much performance.
Upvotes: 2