Reputation: 2011
Pythonists,
What is the fastest way to count the occurrence of a character in a numpy.character
array.
I am doing the following:
In [59]: for i in range(10):
...: m = input("Enter A or B: ")
...: rr[0][i] = m
...:
Enter A or B: B
Enter A or B: B
Enter A or B: B
Enter A or B: A
Enter A or B: B
Enter A or B: A
Enter A or B: A
Enter A or B: A
Enter A or B: B
Enter A or B: A
In [60]: rr
Out[60]:
chararray([['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A']],
dtype='<U1')
In [61]: %timeit a = rr.count('A')
12.5 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [62]: %timeit d = len(a[a.nonzero()])
3.03 µs ± 54.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I believe there must be a better way to achieve this with speed and elegance.
Upvotes: 5
Views: 2200
Reputation: 221594
It's better to stick to regular NumPy arrays over the chararrays
:
Note:
The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.
Going with the regular arrays, let's propose two approaches.
Approach #1
We could use np.count_nonzero
to count the True
ones after comparison against the search element : 'A'
-
np.count_nonzero(rr=='A')
Approach #2
With the chararray
holding single character elements only, we could optimize a lot better by viewing into it with uint8
dtype and then comparing and counting. The counting would be much faster, as we would be working with numeric data. The implementation would be -
np.count_nonzero(rr.view(np.uint8)==ord('A'))
On Python 2.x
, it would be -
np.count_nonzero(np.array(rr.view(np.uint8))==ord('A'))
Timings
Timings on original sample data and scaled to 10,000x
scaled ones -
# Original sample data
In [10]: rr
Out[10]: array(['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A'], dtype='<U1')
# @Nils Werner's soln
In [14]: %timeit np.sum(rr == 'A')
100000 loops, best of 3: 3.86 µs per loop
# Approach #1 from this post
In [13]: %timeit np.count_nonzero(rr=='A')
1000000 loops, best of 3: 1.04 µs per loop
# Approach #2 from this post
In [40]: %timeit np.count_nonzero(rr.view(np.uint8)==ord('A'))
1000000 loops, best of 3: 1.86 µs per loop
# Original sample data scaled by 10,000x
In [16]: rr = np.repeat(rr,10000)
# @Nils Werner's soln
In [18]: %timeit np.sum(rr == 'A')
1000 loops, best of 3: 734 µs per loop
# Approach #1 from this post
In [17]: %timeit np.count_nonzero(rr=='A')
1000 loops, best of 3: 659 µs per loop
# Approach #2 from this post
In [24]: %timeit np.count_nonzero(rr.view(np.uint8)==ord('A'))
10000 loops, best of 3: 40.2 µs per loop
Upvotes: 4
Reputation: 36765
chararray
is deprectated, use array(..., dtype='<U1')
instead. That being said you can do
r = np.array([['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A']])
%timeit numpy.sum(r == 'A')
# 4.82 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Upvotes: 1