Zak
Zak

Reputation: 3253

Convert any dictionary to structured array

I'm trying to generate a structured numpy array which takes field names and variable types from a dictionary. I want it to cope with most contents a user could throw at it.

Currently it works like this:

>>> d = dict( a=0.456, b=1234.5687020, c=4, d=np.arange(3), text='text')
>>> dtype = [(str(key), val.__class__) for key, val in d.iteritems()]     
>>> arr = np.zeros( (5,), dtype=dtype)
>>> arr
array([(0.0, '', 0, 0.0, 0), (0.0, '', 0, 0.0, 0), (0.0, '', 0, 0.0, 0),
   (0.0, '', 0, 0.0, 0), (0.0, '', 0, 0.0, 0)], 
  dtype=[('a', '<f8'), ('text', 'S'), ('c', '<i8'), ('b', '<f8'), ('d', 'O')])

So far so good. But I now try to assign the contents of the example dictionary to the first element, it's not all good:

>>> for key, val in d.iteritems():
...     arr[0][str(key)] = val

>>> arr[0]
(0.456, '', 4, 1234.5687020, [0, 1, 2])

The numbers and the array look okay, but the text is missing. Interestingly, manually assigning to the text field gives a different result:

>>> arr[0]['text'] = 'text'
>>> arr[0]['text'] 
't'

I find this very hard to understand...

My method for determining the types seems to pick a too restrictive type. I expected that things like initializing with float32 and then assigning float64 would reduce in data loss, but I would at least expect the array to be able to hold the example data.

Is there a more robust (possibly even more elegant?) way of determining dtype which allows strings to work properly?

What I look for is a robust-ish way to determine the types of the dictionary contents. If I need to require that the text in the input dictionary defines the maximum string length, that is acceptable, but my function does not know beforehand which keys and types it will get.

Upvotes: 0

Views: 965

Answers (2)

Zak
Zak

Reputation: 3253

The best "automatic solution I've managed to come up with is using the dtype of np array made from each element, rather than the __class__ attribute:

>>> dtype = [(str(key), np.array([val]).dtype) for key, val in d.iteritems()]
>>> dtype
[('a', dtype('float64')), ('text', dtype('S4')), ('c', dtype('int64')), ('b', dtype('float64')), ('d', dtype('int64'))]

>>> arr = np.zeros( (5,), dtype=dtype)
>>> for key, val in d.iteritems():
...     arr[0][str(key)] = val
... 
>>> arr[0]
(0.456, 'text', 4, 1234.568702020934, 0)

This limits text inputs to the length of what's contained in the example data, and will fail if any input is a numpy array (as seen above -- it classifies the array as int because that's what its elements are).

What I ended up doing is compiling a separate list of all elements which are strings, then adding them manually to the dtype, as 'S128'

>>> stringkeys = [ str(key) for key, val in d.iteritems() if 'str' in str(val.__class__)]

>>> dtype = [(str(key), val.__class__) for key, val in d.iteritems() if not 'str' in str(val.__class__)] + [(key, 'S128') for key in stringkeys]

>>> dtype
[('a', <type 'float'>), ('c', <type 'int'>), ('b', <type 'float'>), ('d', <type 'numpy.ndarray'>), ('text', 'S128')]

A lot less elegant, and I suppose there are some other types which I may to catch manually but at least it works.

I had really hoped there could be an expression which automatically yields a type which will just work. And I still don't understand why the loop above does not even assign the string variable at all, although the direct assignment does assign something...

Upvotes: 0

Mike M&#252;ller
Mike M&#252;ller

Reputation: 85462

You need to provide a length for the type S:

dtype = [('a', float), ('b', float), ('c', int), ('d', numpy.ndarray), ('text', 'S10')]
arr = np.zeros( (5,), dtype=dtype)
for key, val in d.items():
    arr[0][str(key)] = val

Now:

>>> arr[0]
( 0.456,  1234.56870202, 4, array([0, 1, 2]), b'text')

Upvotes: 1

Related Questions