Reputation: 705

read values from a text file using numpy loadtxt function

I have a file with this form:

label1, value1, value2, value3,
label2, value1, value2, value3,
...

I want to read it using numpy loadtxt function so I can have each label with its values in an array, so the final result will be an array of arrays, each array of them include the label and an array of features like this:

array([[label1, [value1, value2, value3]],
       [label2, [value1, value2, value3]]])

I have tried the following but did not work:

c = StringIO(u"text.txt")
np.loadtxt(c,
   dtype={'samples': ('label', 'features'), 'formats': ('s9',np.float)},
   delimiter=',', skiprows=0)

any idea?

Upvotes: 1

Answers (2)

hpaulj

Reputation: 231738

You are on the right tract with defining the dtype. You are just missing the field shape.

I'll demonstrate:

A 'text' file - a list of lines (bytes in Py3):

In [95]: txt=b"""label1, 12, 23.2, 232
   ....: label2, 23, 2324, 324
   ....: label3, 34, 123, 2141
   ....: label4, 0, 2, 3
   ....: """

In [96]: txt=txt.splitlines()

A dtype with 2 fields, one with strings, the other with floats (3 for 'field shape'):

In [98]: dt=np.dtype([('label','U10'),('values', 'float',(3))])

In [99]: data=np.genfromtxt(txt,delimiter=',',dtype=dt)

In [100]: data
Out[100]: 
array([('label1', [12.0, 23.2, 232.0]), ('label2', [23.0, 2324.0, 324.0]),
       ('label3', [34.0, 123.0, 2141.0]), ('label4', [0.0, 2.0, 3.0])], 
      dtype=[('label', '<U10'), ('values', '<f8', (3,))])

In [101]: data['label']
Out[101]: 
array(['label1', 'label2', 'label3', 'label4'], 
      dtype='<U10')

In [103]: data['values']
Out[103]: 
array([[  1.20000000e+01,   2.32000000e+01,   2.32000000e+02],
       [  2.30000000e+01,   2.32400000e+03,   3.24000000e+02],
       [  3.40000000e+01,   1.23000000e+02,   2.14100000e+03],
       [  0.00000000e+00,   2.00000000e+00,   3.00000000e+00]])

With this definition the numeric values can be accessed as a 2d array. Sub-arrays like this are under appreciated.

The dtype could be been specified with the dictionary syntax, but I'm more familiar with the list of tuples form.

Equivalent dtype specs:

np.dtype("U10, (3,)f")
np.dtype({'names':['label','values'], 'formats':['S10','(3,)f']})
np.genfromtxt(txt,delimiter=',',dtype='S10,(3,)f')

===============================

I think that this txt, if parsed with dtype=None would produce

In [30]: y
Out[30]: 
array([('label1', 12.0, 23.2, 232.0), ('label2', 23.0, 2324.0, 324.0),
       ('label3', 34.0, 123.0, 2141.0), ('label4', 0.0, 2.0, 3.0)], 
      dtype=[('f0', '<U10'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

The could be converted to the subfield form with

y.view(dt)

This works as long as the underlying data representation (seen as a flat list of bytes) is compatible (here 10 unicode characters (40 bytes), and 3 floats, per record).

Upvotes: 2

B. M.

Reputation: 18668

The most modern and versatile way to do that is to use pandas, whose parser have many more options and manage labels.

Suppose your file contains :

A,7,5,1
B,4,2,7

Then :

In [29]: import pandas as pd
In [30]: df=pd.read_csv('data.csv',sep=',',header=None,index_col=0)

In [31]: df
Out[31]: 
   1  2  3
0         
A  7  5  1
B  4  2  7

You can easily convert it in an struct array now :

In [32]: a=df.T.to_records(index=False)
Out[32]: 
rec.array([(7, 4), (5, 2), (1, 7)], 
          dtype=[('A', '<i8'), ('B', '<i8')])

In [33]: a['A']
Out[33]: array([7, 5, 1], dtype=int64)

With loadtext you will have to do a lot of low level operations manually.

Upvotes: 3

read values from a text file using numpy loadtxt function

Answers (2)

Related Questions