user10813328
user10813328

Reputation:

How to load csv file containing strings and numbers using genfromtxt?

I'm trying to load a csv file in a NumPy array for machine learning purpose. Until now I always worked with int or float data but my current csv contains string, float and int so I have some trouble with dtype argument. My datasets has 41188 samples and 8 features, e.g.:

47;"university.degree";"yes";176;1;93.994;-36.4;4.857;"no"

I know that if I specify dtype=None the types will be determined by the contents of each columns:

data = np.genfromtxt(filename, dtype=None, delimiter=";", skip_header=1)

but it apparently doesn't work. First of all, the result of genfromtxt is a numpy ndarray with the following shape:

In [2]: data.shape
Out[2]: (41188,)

while I expect (41188,8)

Instead, If I use the default dtype:

data2 = np.genfromtxt(filename, delimiter=";", skip_header=1)

I obtain the following shape of data:

In [4]: data2.shape
Out[4]: (41188,8)

Secondly, with dtype=None I obtain the following deprecation warning:

VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.

That I can fix by using (is it correct?):

encoding='ASCII'

I have 2 questions:

  1. How can I set the correct type of each columns?
  2. Why I have to set the encoding?

Upvotes: 4

Views: 6877

Answers (1)

hpaulj
hpaulj

Reputation: 231325

With 2 copies of your sample line:

In [140]: data = np.genfromtxt(txt, dtype=None, delimiter=';', encoding=None)
In [141]: data
Out[141]: 
array([(47, '"university.degree"', '"yes"', 176, 1, 93.994, -36.4, 4.857, '"no"'),
       (47, '"university.degree"', '"yes"', 176, 1, 93.994, -36.4, 4.857, '"no"')],
      dtype=[('f0', '<i8'), ('f1', '<U19'), ('f2', '<U5'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<U4')])
In [142]: data.shape
Out[142]: (2,)
In [143]: data.dtype
Out[143]: dtype([('f0', '<i8'), ('f1', '<U19'), ('f2', '<U5'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<U4')])

This is a normal structured array - data is a 1d array with 8 fields. The fields have dtype that matches the float, integer or string type that's common to each column.

You access fields by name rather than column number:

In [144]: data['f0']
Out[144]: array([47, 47])
In [145]: data['f1']
Out[145]: array(['"university.degree"', '"university.degree"'], dtype='<U19')

Note that I included the encoding=None. I'm not entirely sure when that's necessary, but it's easy to include.

Upvotes: 6

Related Questions