kur ag
kur ag

Reputation: 593

Numpy unique changes integer to string

I have data table which has string and integer columns such as:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

I need unique rows, therefore I used numpy unique function:

summary, repeat = np.unique(test_data,return_counts=True, axis=0)

But after then my data types are changed. Summary is:

array([['A', '1', '2', '3'],
   ['B', '4', '5', '6']], dtype='<U1')

All data types are now string. How can I prevent this change? (Python 3.7, numpy 1.16.4)

Upvotes: 2

Views: 4375

Answers (3)

Mad Physicist
Mad Physicist

Reputation: 114230

If you have python objects and you want to retain them as python objects, use python functions:

unique_rows = set(test_data)

Or better yet:

from collections import Counter

rows_and_counts = Counter(test_data)

These solutions do not copy the data: they retain references to the the original tuples just as they are. The numpy solution copies the data multiple times: once when converting to numpy, at least once when sorting, and possibly more when converting back.

These solutions have O(N) algorithmic complexity because they both use a hash table. The numpy unique solution uses sorting, and is therefore of O(N log N) complexity.

Upvotes: 4

wprazuch
wprazuch

Reputation: 101

You could explicitly specify you dtype in np.array function preceding np.unique:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

test_data = np.array(test_data, dtype=[('letter', '<U1'),
                                ('x', np.int),
                                 ('y', np.int),
                                 ('z', np.int)])
                                 
summary, repeat = np.unique(test_data,return_counts=True, axis=0)

The summary then looks as follows:

array([('A', 1, 2, 3), ('B', 4, 5, 6)],
      dtype=[('letter', '<U1'), ('x', '<i4'), ('y', '<i4'), ('z', '<i4')])

Upvotes: 3

Aratz
Aratz

Reputation: 440

I think this has to do with the fact that in a numpy array, all items have to have the same type, what you could do instead is try to parse back your result when it comes out of numpy, e.g.:

result = []
for l in summary.tolist():
    new_l = []
    for v in l:
        try:
            new_l.append(int(v))
        except ValueError:
            new_l.append(v)
    result.append(tuple(new_l))

Upvotes: 2

Related Questions