Roger
Roger

Reputation: 403

Easiest way to create a NumPy record array from a list of dictionaries?

Say I have data like d = [dict(animal='cat', weight=5), dict(animal='dog', weight=20)] (basically JSON, where all entries have consistent data types).

In Pandas you can make this a table with df = pandas.DataFrame(d) -- is there something comparable for plain NumPy record arrays? np.rec.fromrecords(d) doesn't seem to given me what I want.

Upvotes: 14

Views: 11563

Answers (3)

Zuku
Zuku

Reputation: 1210

Proposal from me (generally it's slightly improved hpaulj's answer):

dicts = [dict(animal='cat', weight=5), dict(animal='dog', weight=20)]

Creation od dtype object:

dt_tuples = []
for key, value in dicts[0].items():
    if not isinstance(value, str):
        value_dtype = np.array([value]).dtype
    else:
        value_dtype = '|S{}'.format(max([len(d[key]) for d in dicts]))
    dt_tuples.append((key, value_dtype))
dt = np.dtype(dt_tuples)

As you see there's a problem with string handling - we need to check it's maximum length to define dtype. This additional condition can be skipped if you do not have string values in your dict or if you're sure that all those values have exactly same length.

If you're looking for one-liner it would be something like this:

dt = np.dtype([(k, np.array([v]).dtype if not isinstance(v, str) else '|S{}'.format(max([len(d[k]) for d in dicts]))) for k, v in dicts[0].items()])

(still it's probably better to break it for readability).

Values list:

values = [tuple(d[name] for name in dt.names) for d in dicts]

Because we iterate over dt.names we are sure that order of values is correct.

And, at the end, array creation:

a = np.array(values, dtype=dt)

Upvotes: 2

hpaulj
hpaulj

Reputation: 231475

You could make an empty structured array of the right size and dtype, and then fill it from the list.

http://docs.scipy.org/doc/numpy/user/basics.rec.html

Structured arrays can be filled by field or row by row. ... If you fill it in row by row, it takes a take a tuple (but not a list or array!):

In [72]: dt=dtype([('weight',int),('animal','S10')])

In [73]: values = [tuple(each.values()) for each in d]

In [74]: values
Out[74]: [(5, 'cat'), (20, 'dog')]

fields in the dt occur in the same order as in values.

In [75]: a=np.zeros((2,),dtype=dt)

In [76]: a[:]=[tuple(each.values()) for each in d]

In [77]: a
Out[77]: 
array([(5, 'cat'), (20, 'dog')], 
      dtype=[('weight', '<i4'), ('animal', 'S10')])

With a bit more testing I found I can create the array directly from values.

In [83]: a = np.array(values, dtype=dt)

In [84]: a
Out[84]: 
array([(5, 'cat'), (20, 'dog')], 
      dtype=[('weight', '<i4'), ('animal', 'S10')])

The dtype could be deduced from one (or more) of the dictionary items:

def gettype(v):
    if isinstance(v,int): return 'int'
    elif isinstance(v,float): return 'float'
    else:
        assert isinstance(v,str)
        return '|S%s'%(len(v)+10)
d0 = d[0]
names = d0.keys()
formats = [gettype(v) for v in d0.values()]
dt = np.dtype({'names':names, 'formats':formats})

producing:

dtype=[('weight', '<i4'), ('animal', 'S13')]

Upvotes: 8

ZJS
ZJS

Reputation: 4051

Well you could make your life extra easy and just rely on Pandas since numpy doesn't use column headers

Pandas

df = pandas.DataFrame(d)
numpyMatrix = df.as_matrix() #spits out a numpy matrix

Or you can ignore Pandas and use numpy + list comprehension to knock down the dicts to values and store as matrix

Numpy

numpMatrix = numpy.matrix([each.values() for each in d])

Upvotes: 5

Related Questions