mbudge
mbudge

Reputation: 557

Defining columns in numpy array - TypeError: invalid type promotion

I am defining an array which should look like this

['word1', 2000, 21]
['word2', 2002, 33]
['word3', 1988, 51]
['word4', 1999, 26]
['word5', 2001, 72]

However when I append an a new entry I get a TypeError.

import numpy as np

npdtype = [('word', 'S35'), ('year', int), ('wordcount', int)]
np_array = np.empty((0,3), dtype=npdtype)

word = 'word1'
year = '2001'
word_count = '21'

np_array = np.append(np_array, [['word1', int(year), int(word_count)]], axis=0)

Traceback

 File "/home/matt/.local/lib/python2.7/site-packages/numpy/lib/function_base.py", line 4586, in append
return concatenate((arr, values), axis=axis)
 TypeError: invalid type promotion

What am I doing wrong?

Thanks

Upvotes: 1

Views: 2419

Answers (2)

hpaulj
hpaulj

Reputation: 231385

append is a way of calling np.concatenate. Look at its code. Note it has to make sure the 2nd argument is an array. It does that without knowledge of your special dtype. Try that. It probably produces a string dtype. Then it tries the concatenate. So you need to make an array with the correct dtype first.

I discourage the use of append; it's better to use concatenate directly so you have understand all details.

======================

Expanding on your answer:

In [75]: npdtype
Out[75]: [('word', 'S35'), ('year', numpy.int16), ('wordcount', numpy.int16)]
In [76]: column = np.array( [b'word1', np.int16(year), np.int16(word_count)], dtype=npdtype)
In [77]: column
Out[77]: 
array([(b'word1', 0, 0), 
       (b'\xd1\x07', 0, 0), 
       (b'\x15', 0, 0)], 
      dtype=[('word', 'S35'), ('year', '<i2'), ('wordcount', '<i2')])

I don't think this is what you want.

The correct way to provide data for structured array record is with a tuple, or list of tuples (note the extra ()):

In [78]: column = np.array( [(b'word1', np.int16(year), np.int16(word_count))], dtype=npdtype)
In [79]: column
Out[79]: 
array([(b'word1', 2001, 21)], 
      dtype=[('word', 'S35'), ('year', '<i2'), ('wordcount', '<i2')])
In [80]: column.shape
Out[80]: (1,)

Now I have a 1d, 1 element array with 3 fields.

Without the [], I get a single element 0d array

In [81]: column0 = np.array( (b'word1', np.int16(year), np.int16(word_count)), dtype=npdtype)
In [82]: column0.shape
Out[82]: ()
In [83]: column0
Out[83]: 
array((b'word1', 2001, 21), 
      dtype=[('word', 'S35'), ('year', '<i2'), ('wordcount', '<i2')])

I can concatenate several of the 1d arrays:

In [85]: np.concatenate([column,column,column])
Out[85]: 
array([(b'word1', 2001, 21), 
       (b'word1', 2001, 21), 
       (b'word1', 2001, 21)], 
      dtype=[('word', 'S35'), ('year', '<i2'), ('wordcount', '<i2')])
In [86]: _.shape
Out[86]: (3,)
In [87]: __['year']   # access the 2nd field (not column)
Out[87]: array([2001, 2001, 2001], dtype=int16)

Regarding the need for b. You are using Py3 (as I am), and unicode is the default string type. So if you had used U35 in npdtype, you could have left off the b (bytestring flag).

That (0,3) shape initial array is probably not what you want. 0 rows, 3 columns, but still has 3 dtype fields. Look at a (1,3) version

In [88]: np.empty((1,3),dtype=npdtype)
Out[88]: 
array([[(b'', 0, 0), (b'', 0, 0), (b'', 0, 0)]], 
      dtype=[('word', 'S35'), ('year', '<i2'), ('wordcount', '<i2')])

This has blanks and 0 because of what happens to be in the memory. They could have been random characters/numbers.

numpy lets you make arrays with one or more 0 dimensions, but they usually aren't useful. About the only place they appear is as the starting point for an iterative array definition, e.g.

 arr = np.empty((0,3))
 for i in range(10):
     arr = np.append(arr, [i,i+1,i+2])

which is better writen as

 ll = []
 for i in range(10):
     ll.append([i,i+1,i+2])
 arr = np.array(ll)

or

 arr = np.empty((10,3))
 for i in range(10):
     arr[i,:]=[i,i+1,i+2]

repeated array concatenate is slower.

Upvotes: 2

Bill Bell
Bill Bell

Reputation: 21643

Follow @hpaulj's advice and then tidy up.

import numpy as np

npdtype = [('word', 'S35'), ('year', np.int16), ('wordcount', np.int16)]
np_array = np.empty((0,3), dtype=npdtype)

word = 'word1'
year = '2001'
word_count = '21'

column = np.array( [b'word1', np.int16(year), np.int16(word_count)], dtype=npdtype)
print (column.shape)
column.shape=-1,3
print (column.shape)
print (column)
result=np.concatenate((np_array,column),axis=0)
print (result)

#~ np_array = np.append(np_array, [['word1', int(year), int(word_count)]], axis=0)

The two things that I found:

  • Meticulous matching of the types of record items is required, hence the use of numpy types in the definition of npdtype and conversions of strings; and also use of b prefix to the first element of the record.
  • The created column has a curious shape, thus the need to reshape it.

Here's the output.

>pythonw -u "temp.py"
(3,)
(1, 3)
[[(b'word1', 0, 0) (b'\xd1\x07', 0, 0) (b'\x15', 0, 0)]]
[[(b'word1', 0, 0) (b'\xd1\x07', 0, 0) (b'\x15', 0, 0)]]
>Exit code: 0

Upvotes: 0

Related Questions