Reputation:
I'm trying to load a csv file in a NumPy array for machine learning purpose. Until now I always worked with int or float data but my current csv contains string, float and int so I have some trouble with dtype argument. My datasets has 41188 samples and 8 features, e.g.:
47;"university.degree";"yes";176;1;93.994;-36.4;4.857;"no"
I know that if I specify dtype=None the types will be determined by the contents of each columns:
data = np.genfromtxt(filename, dtype=None, delimiter=";", skip_header=1)
but it apparently doesn't work. First of all, the result of genfromtxt is a numpy ndarray with the following shape:
In [2]: data.shape
Out[2]: (41188,)
while I expect (41188,8)
Instead, If I use the default dtype:
data2 = np.genfromtxt(filename, delimiter=";", skip_header=1)
I obtain the following shape of data:
In [4]: data2.shape
Out[4]: (41188,8)
Secondly, with dtype=None I obtain the following deprecation warning:
VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
That I can fix by using (is it correct?):
encoding='ASCII'
I have 2 questions:
Upvotes: 4
Views: 6877
Reputation: 231325
With 2 copies of your sample line:
In [140]: data = np.genfromtxt(txt, dtype=None, delimiter=';', encoding=None)
In [141]: data
Out[141]:
array([(47, '"university.degree"', '"yes"', 176, 1, 93.994, -36.4, 4.857, '"no"'),
(47, '"university.degree"', '"yes"', 176, 1, 93.994, -36.4, 4.857, '"no"')],
dtype=[('f0', '<i8'), ('f1', '<U19'), ('f2', '<U5'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<U4')])
In [142]: data.shape
Out[142]: (2,)
In [143]: data.dtype
Out[143]: dtype([('f0', '<i8'), ('f1', '<U19'), ('f2', '<U5'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<f8'), ('f8', '<U4')])
This is a normal structured array - data
is a 1d array with 8 fields. The fields have dtype
that matches the float, integer or string type that's common to each column.
You access fields by name rather than column number:
In [144]: data['f0']
Out[144]: array([47, 47])
In [145]: data['f1']
Out[145]: array(['"university.degree"', '"university.degree"'], dtype='<U19')
Note that I included the encoding=None
. I'm not entirely sure when that's necessary, but it's easy to include.
Upvotes: 6