Reputation: 593
I have data table which has string and integer columns such as:
test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]
I need unique rows, therefore I used numpy unique function:
summary, repeat = np.unique(test_data,return_counts=True, axis=0)
But after then my data types are changed. Summary is:
array([['A', '1', '2', '3'],
['B', '4', '5', '6']], dtype='<U1')
All data types are now string. How can I prevent this change? (Python 3.7, numpy 1.16.4)
Upvotes: 2
Views: 4375
Reputation: 114230
If you have python objects and you want to retain them as python objects, use python functions:
unique_rows = set(test_data)
Or better yet:
from collections import Counter
rows_and_counts = Counter(test_data)
These solutions do not copy the data: they retain references to the the original tuples just as they are. The numpy solution copies the data multiple times: once when converting to numpy, at least once when sorting, and possibly more when converting back.
These solutions have O(N)
algorithmic complexity because they both use a hash table. The numpy unique
solution uses sorting, and is therefore of O(N log N)
complexity.
Upvotes: 4
Reputation: 101
You could explicitly specify you dtype in np.array function preceding np.unique:
test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]
test_data = np.array(test_data, dtype=[('letter', '<U1'),
('x', np.int),
('y', np.int),
('z', np.int)])
summary, repeat = np.unique(test_data,return_counts=True, axis=0)
The summary then looks as follows:
array([('A', 1, 2, 3), ('B', 4, 5, 6)],
dtype=[('letter', '<U1'), ('x', '<i4'), ('y', '<i4'), ('z', '<i4')])
Upvotes: 3
Reputation: 440
I think this has to do with the fact that in a numpy array, all items have to have the same type, what you could do instead is try to parse back your result when it comes out of numpy, e.g.:
result = []
for l in summary.tolist():
new_l = []
for v in l:
try:
new_l.append(int(v))
except ValueError:
new_l.append(v)
result.append(tuple(new_l))
Upvotes: 2