caren vanderlee
caren vanderlee

Reputation: 195

How can I change NumPy array elements from string to int or float?

I have a data set stored in NumPy array like shown in below, but all the data inside it is stored as string. How can I change the string to int or float, and store it in back?

  data = numpy.array([]) # <--- array initialized with numpy.array

in the data variable, below information is stored

 [['1' '0' '3' ..., '7.25' '' 'S']
  ['2' '1' '1' ..., '71.2833' 'C85' 'C']
   ['3' '1' '3' ..., '7.925' '' 'S']
   ..., 
   ['889' '0' '3' ..., '23.45' '' 'S']
   ['890' '1' '1' ..., '30' 'C148' 'C']
   ['891' '0' '3' ..., '7.75' '' 'Q']]

I want to change the first column to int and store the values back. To do so, I did:

 data[0::,0] = data[0::,0].astype(int)

but, it didn't change anything.

Upvotes: 2

Views: 10258

Answers (3)

hpaulj
hpaulj

Reputation: 231738

I can make an array that contains strings by starting with lists of strings; note the S4 dtype:

In [690]: data=np.array([['1','0','7.23','two'],['2','3','1.32','four']])

In [691]: data
Out[691]: 
array([['1', '0', '7.23', 'two'],
       ['2', '3', '1.32', 'four']], 
      dtype='|S4')

It's more likely that such an array is created by reading a csv file.

I can also view it as an array of single byte strings - the shape and dtype has changed, but the databuffer is the same (the same 32 bytes)

In [692]: data.view('S1')
Out[692]: 
array([['1', '', '', '', '0', '', '', '', '7', '.', '2', '3', 't', 'w',
        'o', ''],
       ['2', '', '', '', '3', '', '', '', '1', '.', '3', '2', 'f', 'o',
        'u', 'r']], 
      dtype='|S1')

In fact, I can change an individual byte, changing the two of the original array to twos:

In [693]: data.view('S1')[0,-1]='s'

In [694]: data
Out[694]: 
array([['1', '0', '7.23', 'twos'],
       ['2', '3', '1.32', 'four']], 
      dtype='|S4')

But if I try to change an element of data to an integer, it is converted to a string to match the S4 dtype:

In [695]: data[1,0]=4

In [696]: data
Out[696]: 
array([['1', '0', '7.23', 'twos'],
       ['4', '3', '1.32', 'four']], 
      dtype='|S4')

The same would happen if the number came from int(data[1,0]) or some variation on that.

But I can trick it into seeing the integer as a string of bytes (represented as \x04)

In [704]: data[1,0]=np.array(4).view('S4')

In [705]: data
Out[705]: 
array([['1', '0', '7.23', 'twos'],
       ['\x04', '3', '1.32', 'four']], 
      dtype='|S4')

Arrays can share data buffers. The data attribute is a pointer to a block of memory. It's the array's dtype that controls how that block is interpreted. For example I can make another array of ints, and redirect it's data attribute:

In [714]: d2=np.zeros((2,4),dtype=int)

In [715]: d2
Out[715]: 
array([[0, 0, 0, 0],
       [0, 0, 0, 0]])

In [716]: d2.data=data.data  # change the data pointer

In [717]: d2
Out[717]: 
array([[        49,         48,  858926647, 1936684916],
       [         4,         51,  842214961, 1920298854]])

Now d2[1,0] is the integer 4. But the other items are not recognizable, because they are strings viewed as integers. That's not the same as passing them through the int() function.

I don't recommend changing the data pointer like this as a regular practice. It would be easy to mess things up. I had to take care to ensure that d2.nbytes was 32, the same as for data.

Because the buffer is sharded, a change to d2 also appears in data (but displayed according to a different dtype):

In [718]: d2[0,0]=3

In [719]: data
Out[719]: 
array([['\x03', '0', '7.23', 'twos'],
       ['\x04', '3', '1.32', 'four']], 
      dtype='|S4')

A view with a complex dtype does something similar:

In [723]: data.view('i4,i4,f,|S4')
Out[723]: 
array([[(3, 48, 4.148588672592268e-08, 'twos')],
       [(4, 51, 1.042967401332362e-08, 'four')]], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', 'S4')])

Notice the 48 and 51 that also appear in d2. The next float column is unrecognizable.

That gives an idea of what can and cannot be done 'in-place'.

But to get an array that contains numbers and strings in a meaningful way, I it is better to construct a new structured array. Perhaps the cleanest way to do that is with an intermediary list of tuples.

In [759]: dl=[tuple(i) for i in data.tolist()]

In [760]: dl
Out[760]: [('1', '0', '7.23', 'two'), ('2', '3', '1.32', 'four')]

In [761]: np.array(dl,dtype='i4,i4,f,|S4')
Out[761]: 
array([(1, 0, 7.230000019073486, 'two'), (2, 3, 1.3200000524520874, 'four')], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', 'S4')])

All these fields take up 4 bytes, so the nbytes is the same. But the individual values have passed through converters. I have given 'np.array' the freedom to convert values as is consistent for the input and the new dtype. That's a lot easier than trying to perform some sort of convoluted in-place conversion.

A list tuples with a mix of numbers and strings would also have worked:

[(1, 0, 7.23, 'two'), (2, 3, 1.32, 'four')]

Structured arrays are displayed a list of tuples. And in the structured array docs, values are always input as list of tuples.

recarray can also be used, but essentially that is just a array subclass that lets you access fields as attributes.

If the original array was generated from a csv file, it would have been better to use np.genfromtxt (or loadtxt) with appropriate options. It can generate the appropriate list(s) of tuples, and return a structured array directly.

Upvotes: 1

das-g
das-g

Reputation: 10004

NumPy arrays have associated types for their elements. Assigning to a slice of a NumPy array will up-cast the new data to that type. If that's not possible, the assignment will fail with an exception:

import numpy
a = numpy.array([[1, 2],[3, 4]])
print a
# [[1 2]
#  [3 4]]
print a.dtype
# int64

a[0,0] = 'look, a string'
# ValueError: invalid literal for long() with base 10: 'a'

In your case, data[0::,0].astype(int) will produce a NumPy array with associated member type int64, but assigning back into a slice of the original array will convert them back to strings.

Other than standard NumPy arrays, NumPy record arrays mentioned in Padraic's comment allow for different types for different columns.

I don't know if a standard NumPy array can be converted to a NumPy record array in-place, so constructing one like suggested in enrico's answer with

data = np.array([(1, 'a'), (2, 'b')], dtype='i4, S4')

might be the best option. If that's not possible, you can construct one from your standard NumPy array and overwrite the variable with the result:

import numpy
data = numpy.array([['1', '0', '3', '7.25', '', 'S'],
                    ['2', '1', '1', '71.2833', 'C85', 'C'],
                    ['3', '1', '3', '7.925', '', 'S'],
                    ['889', '0', '3', '23.45', '', 'S'],
                    ['890', '1', '1', '30', 'C148', 'C'],
                    ['891', '0', '3', '7.75', '', 'Q']])
print(repr(data))
# array([['1', '0', '3', '7.25', '', 'S'],
#        ['2', '1', '1', '71.2833', 'C85', 'C'],
#        ['3', '1', '3', '7.925', '', 'S'],
#        ['889', '0', '3', '23.45', '', 'S'],
#        ['890', '1', '1', '30', 'C148', 'C'],
#        ['891', '0', '3', '7.75', '', 'Q']], 
#       dtype='|S7')

data = numpy.core.records.fromarrays(data.T, dtype='i4,S4,S4,S4,S4,S4')
print(repr(data))
# rec.array([(1, '0', '3', '7.25', '', 'S'), (2, '1', '1', '71.2', 'C85', 'C'),
#        (3, '1', '3', '7.92', '', 'S'), (889, '0', '3', '23.4', '', 'S'),
#        (890, '1', '1', '30', 'C148', 'C'), (891, '0', '3', '7.75', '', 'Q')], 
#       dtype=[('f0', '<i4'), ('f1', '|S4'), ('f2', '|S4'), ('f3', '|S4'), ('f4', '|S4'), ('f5', '|S4')])

Upvotes: 0

enrico.bacis
enrico.bacis

Reputation: 31534

You could set the data type (dtype) at array initialization. For example if your rows are composed by one 32-bit integer and one 4-byte string you could specify the dtype 'i4, S4'.

data = np.array([(1, 'a'), (2, 'b')], dtype='i4, S4')

You could read more about dtypes here.

Upvotes: 3

Related Questions