Reputation: 403
Say I have data like d = [dict(animal='cat', weight=5), dict(animal='dog', weight=20)]
(basically JSON, where all entries have consistent data types).
In Pandas you can make this a table with df = pandas.DataFrame(d)
-- is there something comparable for plain NumPy record arrays? np.rec.fromrecords(d)
doesn't seem to given me what I want.
Upvotes: 14
Views: 11563
Reputation: 1210
Proposal from me (generally it's slightly improved hpaulj's answer):
dicts = [dict(animal='cat', weight=5), dict(animal='dog', weight=20)]
Creation od dtype
object:
dt_tuples = []
for key, value in dicts[0].items():
if not isinstance(value, str):
value_dtype = np.array([value]).dtype
else:
value_dtype = '|S{}'.format(max([len(d[key]) for d in dicts]))
dt_tuples.append((key, value_dtype))
dt = np.dtype(dt_tuples)
As you see there's a problem with string handling - we need to check it's maximum length to define dtype. This additional condition can be skipped if you do not have string values in your dict or if you're sure that all those values have exactly same length.
If you're looking for one-liner it would be something like this:
dt = np.dtype([(k, np.array([v]).dtype if not isinstance(v, str) else '|S{}'.format(max([len(d[k]) for d in dicts]))) for k, v in dicts[0].items()])
(still it's probably better to break it for readability).
Values list:
values = [tuple(d[name] for name in dt.names) for d in dicts]
Because we iterate over dt.names
we are sure that order of values is correct.
And, at the end, array creation:
a = np.array(values, dtype=dt)
Upvotes: 2
Reputation: 231475
You could make an empty structured array of the right size and dtype, and then fill it from the list.
http://docs.scipy.org/doc/numpy/user/basics.rec.html
Structured arrays can be filled by field or row by row. ... If you fill it in row by row, it takes a take a tuple (but not a list or array!):
In [72]: dt=dtype([('weight',int),('animal','S10')])
In [73]: values = [tuple(each.values()) for each in d]
In [74]: values
Out[74]: [(5, 'cat'), (20, 'dog')]
fields in the dt
occur in the same order as in values
.
In [75]: a=np.zeros((2,),dtype=dt)
In [76]: a[:]=[tuple(each.values()) for each in d]
In [77]: a
Out[77]:
array([(5, 'cat'), (20, 'dog')],
dtype=[('weight', '<i4'), ('animal', 'S10')])
With a bit more testing I found I can create the array directly from values
.
In [83]: a = np.array(values, dtype=dt)
In [84]: a
Out[84]:
array([(5, 'cat'), (20, 'dog')],
dtype=[('weight', '<i4'), ('animal', 'S10')])
The dtype
could be deduced from one (or more) of the dictionary items:
def gettype(v):
if isinstance(v,int): return 'int'
elif isinstance(v,float): return 'float'
else:
assert isinstance(v,str)
return '|S%s'%(len(v)+10)
d0 = d[0]
names = d0.keys()
formats = [gettype(v) for v in d0.values()]
dt = np.dtype({'names':names, 'formats':formats})
producing:
dtype=[('weight', '<i4'), ('animal', 'S13')]
Upvotes: 8
Reputation: 4051
Well you could make your life extra easy and just rely on Pandas since numpy doesn't use column headers
Pandas
df = pandas.DataFrame(d)
numpyMatrix = df.as_matrix() #spits out a numpy matrix
Or you can ignore Pandas and use numpy + list comprehension to knock down the dicts to values and store as matrix
Numpy
numpMatrix = numpy.matrix([each.values() for each in d])
Upvotes: 5