user1581390
user1581390

Reputation: 1998

Pandas: What is dtype = <U64, and How Do I Convert it to String?

I have a table, and one column is loaded as np.str from csv. But the dtype says this weird U64 (I guess meaning, unsigned int 64 bit?) and converting with astype doesn't work.

stringIDs = extractedBatch.ID.astype(np.str)

After astype the dtype says 'object'

Upvotes: 4

Views: 17978

Answers (2)

hpaulj
hpaulj

Reputation: 231335

In [313]: arr = np.array(['one','twenty'])                                                               
In [314]: arr                                                                                            
Out[314]: array(['one', 'twenty'], dtype='<U6')
In [315]: arr.astype(object)                                                                             
Out[315]: array(['one', 'twenty'], dtype=object)

np.char applies string methods to elements of a string dtype array:

In [316]: np.char.add(arr, ' foo')                                                                       
Out[316]: array(['one foo', 'twenty foo'], dtype='<U10')

add is not defined for numpy string dtypes:

In [317]: np.add(arr, ' foo')                                                                            
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-317-eff87c160b77> in <module>
----> 1 np.add(arr, ' foo')

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U6') dtype('<U6') dtype('<U6')

This use of np.add turns the 'foo' string into an array before trying to use it. It's trying to add a 'U6' string to a 'U6' string.

np.add when applied to an object dtype array, delegates the action to the corresponding method of the elements. Since add is defined for Python strings, it works:

In [318]: np.add(arr.astype(object), ' foo')                                                             
Out[318]: array(['one foo', 'twenty foo'], dtype=object)

This pattern applies to all the numpy ufunc. They are defined for specific dtypes. If given object dtypes, they delegate - which may or may not work depending on the methods of the elements.

Both the object and np.char approaches do the equivalent of a list comprehension, and at about the same speed:

In [324]: [i+' foo' for i in arr]                                                                        
Out[324]: ['one foo', 'twenty foo']

===

Example with string replication *

In [319]: arr*2                                                                                          
TypeError: ufunc 'multiply' did not contain a loop with signature matching types dtype('<U6') dtype('<U6') dtype('<U6')

In [320]: arr.astype(object)*2                                                                           
Out[320]: array(['oneone', 'twentytwenty'], dtype=object)

In [322]: np.char.multiply(arr,2)                                                                        
Out[322]: array(['oneone', 'twentytwenty'], dtype='<U12')

Upvotes: 3

Andy Hayden
Andy Hayden

Reputation: 375395

Pandas doesn't use a str dtype, it uses object (even if the underlying values are str):

In [11]: s = pd.Series(['a'], dtype='U64')

In [12]: type(s[0])
Out[12]: str

Upvotes: 0

Related Questions