Bussller
Bussller

Reputation: 2011

fastest way to count the number of occurences of a character in a numpy.chararray

Pythonists,

What is the fastest way to count the occurrence of a character in a numpy.character array.

I am doing the following:

In [59]: for i in range(10):
...:     m = input("Enter A or B: ")
...:     rr[0][i] = m
...:     
Enter A or B: B
Enter A or B: B
Enter A or B: B
Enter A or B: A
Enter A or B: B
Enter A or B: A
Enter A or B: A
Enter A or B: A
Enter A or B: B
Enter A or B: A

In [60]: rr
Out[60]: 
chararray([['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A']],
          dtype='<U1')

In [61]: %timeit a = rr.count('A')
12.5 µs ± 206 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [62]: %timeit d = len(a[a.nonzero()])
3.03 µs ± 54.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I believe there must be a better way to achieve this with speed and elegance.

Upvotes: 5

Views: 2200

Answers (2)

Divakar
Divakar

Reputation: 221594

It's better to stick to regular NumPy arrays over the chararrays :

Note:

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

Going with the regular arrays, let's propose two approaches.

Approach #1

We could use np.count_nonzero to count the True ones after comparison against the search element : 'A' -

np.count_nonzero(rr=='A')

Approach #2

With the chararray holding single character elements only, we could optimize a lot better by viewing into it with uint8 dtype and then comparing and counting. The counting would be much faster, as we would be working with numeric data. The implementation would be -

np.count_nonzero(rr.view(np.uint8)==ord('A'))

On Python 2.x, it would be -

np.count_nonzero(np.array(rr.view(np.uint8))==ord('A'))

Timings

Timings on original sample data and scaled to 10,000x scaled ones -

# Original sample data
In [10]: rr
Out[10]: array(['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A'], dtype='<U1')

# @Nils Werner's soln
In [14]: %timeit np.sum(rr == 'A')
100000 loops, best of 3: 3.86 µs per loop

# Approach #1 from this post
In [13]: %timeit np.count_nonzero(rr=='A')
1000000 loops, best of 3: 1.04 µs per loop

# Approach #2 from this post
In [40]: %timeit np.count_nonzero(rr.view(np.uint8)==ord('A'))
1000000 loops, best of 3: 1.86 µs per loop

# Original sample data scaled by 10,000x
In [16]: rr = np.repeat(rr,10000)

# @Nils Werner's soln
In [18]: %timeit np.sum(rr == 'A')
1000 loops, best of 3: 734 µs per loop

# Approach #1 from this post
In [17]: %timeit np.count_nonzero(rr=='A')
1000 loops, best of 3: 659 µs per loop

# Approach #2 from this post
In [24]: %timeit np.count_nonzero(rr.view(np.uint8)==ord('A'))
10000 loops, best of 3: 40.2 µs per loop

Upvotes: 4

Nils Werner
Nils Werner

Reputation: 36765

chararray is deprectated, use array(..., dtype='<U1') instead. That being said you can do

r = np.array([['B', 'B', 'B', 'A', 'B', 'A', 'A', 'A', 'B', 'A']])

%timeit numpy.sum(r == 'A')
# 4.82 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Upvotes: 1

Related Questions