Kenenbek Arzymatov
Kenenbek Arzymatov

Reputation: 9119

How to correctly define numpy dtype

I have such piece of code, where I try to load four columns from csv file

import numpy as np
rtype = np.dtype([('1', np.float), ('2', np.float), ('3', np.float), ('tier', np.str, 32)])
x1, x2, x3, x4 = np.genfromtxt("../Data/out.txt", dtype=rtype, skip_header=1, delimiter=",", usecols=(3, 4, 5, 6), unpack=True)

But I have an error:

ValueError: too many values to unpack (expected 4)

This is a lit bit strange because I have four variables and load four columns.

How to load them correctly? IMHO, problem is in np.dtype because without it, all works fine (with other types though). I use python3.

Upvotes: 1

Views: 493

Answers (1)

hpaulj
hpaulj

Reputation: 231385

Looks like you have a text like:

In [447]: txt=b"""1.2 3.3 2.0 str
     ...: 3.3 3.3 2.2 astring
     ...: """

My first choice is genfromtxt with dtype=None (automatic dtype determination):

In [448]: np.genfromtxt(txt.splitlines(),dtype=None)
Out[448]: 
array([(1.2, 3.3, 2.0, b'str'), (3.3, 3.3, 2.2, b'astring')], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S7')])

Without dtype it tries to make everything float - including the string column:

In [449]: np.genfromtxt(txt.splitlines())
Out[449]: 
array([[ 1.2,  3.3,  2. ,  nan],
       [ 3.3,  3.3,  2.2,  nan]])

I don't use unpack much, preferring to get one 2d or structured array. But with unpack:

In [450]: x1,x2,x3,x4=np.genfromtxt(txt.splitlines(),unpack=True)
In [451]: x1
Out[451]: array([ 1.2,  3.3])
In [452]: x4
Out[452]: array([ nan,  nan])

I still get the nan for the string column.

Borrowing the dtype from the dtype=None case:

In [456]: dt=np.dtype([('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S7')])
In [457]: dt
Out[457]: dtype([('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S7')])
In [458]: np.genfromtxt(txt.splitlines(),unpack=True,dtype=dt)
Out[458]: 
array([(1.2, 3.3, 2.0, b'str'), (3.3, 3.3, 2.2, b'astring')], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S7')])
In [459]: _.shape
Out[459]: (2,)

With this compound dtype, unpack gives me one item per row of the text, not one item per column. In other words, unpack does not split up the structured fields.

One way to handle the string column and still use unpack is to read the text twice:

first load the float columns:

In [462]: x1,x2,x3=np.genfromtxt(txt.splitlines(),unpack=True,usecols=[0,1,2])
In [463]: x3
Out[463]: array([ 2. ,  2.2])

then load the string column, with dtype=None or S32:

In [466]: x4=np.genfromtxt(txt.splitlines(),unpack=True,usecols=[3],dtype=None)
In [467]: x4
Out[467]: 
array([b'str', b'astring'], 
      dtype='|S7')

Another option is to load the structured array, and unpack the fields individually

In [468]: data = np.genfromtxt(txt.splitlines(),dtype=None)
In [469]: data.dtype
Out[469]: dtype([('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', 'S7')])
In [470]: x1, x2, x3 = data['f0'],data['f1'],data['f2']
In [471]: x4 = data['f3']
In [472]: x4
Out[472]: 
array([b'str', b'astring'], 
      dtype='|S7')

The safest way to use genfromtxt is

data = np.genfromtxt(...)
print(data.shape)
print(data.dtype)

and then make sure you understand that shape and dtype before moving on to using the data array.

Upvotes: 2

Related Questions