Reputation: 705
I have a file with this form:
label1, value1, value2, value3,
label2, value1, value2, value3,
...
I want to read it using numpy loadtxt function so I can have each label with its values in an array, so the final result will be an array of arrays, each array of them include the label and an array of features like this:
array([[label1, [value1, value2, value3]],
[label2, [value1, value2, value3]]])
I have tried the following but did not work:
c = StringIO(u"text.txt")
np.loadtxt(c,
dtype={'samples': ('label', 'features'), 'formats': ('s9',np.float)},
delimiter=',', skiprows=0)
any idea?
Upvotes: 1
Views: 2269
Reputation: 231738
You are on the right tract with defining the dtype. You are just missing the field shape.
I'll demonstrate:
A 'text' file - a list of lines (bytes in Py3):
In [95]: txt=b"""label1, 12, 23.2, 232
....: label2, 23, 2324, 324
....: label3, 34, 123, 2141
....: label4, 0, 2, 3
....: """
In [96]: txt=txt.splitlines()
A dtype
with 2 fields, one with strings, the other with floats (3 for 'field shape'):
In [98]: dt=np.dtype([('label','U10'),('values', 'float',(3))])
In [99]: data=np.genfromtxt(txt,delimiter=',',dtype=dt)
In [100]: data
Out[100]:
array([('label1', [12.0, 23.2, 232.0]), ('label2', [23.0, 2324.0, 324.0]),
('label3', [34.0, 123.0, 2141.0]), ('label4', [0.0, 2.0, 3.0])],
dtype=[('label', '<U10'), ('values', '<f8', (3,))])
In [101]: data['label']
Out[101]:
array(['label1', 'label2', 'label3', 'label4'],
dtype='<U10')
In [103]: data['values']
Out[103]:
array([[ 1.20000000e+01, 2.32000000e+01, 2.32000000e+02],
[ 2.30000000e+01, 2.32400000e+03, 3.24000000e+02],
[ 3.40000000e+01, 1.23000000e+02, 2.14100000e+03],
[ 0.00000000e+00, 2.00000000e+00, 3.00000000e+00]])
With this definition the numeric values can be accessed as a 2d array. Sub-arrays like this are under appreciated.
The dtype
could be been specified with the dictionary syntax, but I'm more familiar with the list of tuples form.
Equivalent dtype specs:
np.dtype("U10, (3,)f")
np.dtype({'names':['label','values'], 'formats':['S10','(3,)f']})
np.genfromtxt(txt,delimiter=',',dtype='S10,(3,)f')
===============================
I think that this txt, if parsed with dtype=None
would produce
In [30]: y
Out[30]:
array([('label1', 12.0, 23.2, 232.0), ('label2', 23.0, 2324.0, 324.0),
('label3', 34.0, 123.0, 2141.0), ('label4', 0.0, 2.0, 3.0)],
dtype=[('f0', '<U10'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])
The could be converted to the subfield form with
y.view(dt)
This works as long as the underlying data representation (seen as a flat list of bytes) is compatible (here 10 unicode characters (40 bytes), and 3 floats, per record).
Upvotes: 2
Reputation: 18668
The most modern and versatile way to do that is to use pandas, whose parser have many more options and manage labels.
Suppose your file contains :
A,7,5,1
B,4,2,7
Then :
In [29]: import pandas as pd
In [30]: df=pd.read_csv('data.csv',sep=',',header=None,index_col=0)
In [31]: df
Out[31]:
1 2 3
0
A 7 5 1
B 4 2 7
You can easily convert it in an struct array now :
In [32]: a=df.T.to_records(index=False)
Out[32]:
rec.array([(7, 4), (5, 2), (1, 7)],
dtype=[('A', '<i8'), ('B', '<i8')])
In [33]: a['A']
Out[33]: array([7, 5, 1], dtype=int64)
With loadtext
you will have to do a lot of low level operations manually.
Upvotes: 3