Reputation: 493
New to Python and Numpy
and MatPlotLib
.
I am trying to create a 2D
Numpy
array from a CSV
of various data types, but I will treat them all as strings. The killer is that I need to be able to access them with tuple
indices, like: [:,5]
to get the 5th column, or [5]
to get the 5th row.
Is there any way to do this?
It seems that this is a limitation of Numpy
due to the memory-access calculations:
dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index
I have also tried loadfromgen
, dtype = str
and dtype = "a16"
, as well as dtype = object
. Nothing works. I can either load the data and it does not have column access, or I can't load the data at all.
Upvotes: 0
Views: 791
Reputation: 231385
Simulate you file from the comment line - replicated several time (i.e. one string per row of the file):
In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]
In [10]: txt
Out[10]:
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']
Load with genfromtxt
, with delimiter. Let it choose the best dtype per column:
In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]:
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
(39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...],
dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])
5 element array with a compound dtype
In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
('f3', 'S10'), ('f4', '<i4'), ....])
Access a 'column' with a field name (not column number)
In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])
Or load as dtype=str:
In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]:
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
...
' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']],
dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]:
array([' 13', ' 13', ' 13', ' 13', ' 13'],
dtype='<U14')
Now it is 15 column 2d array that can be indexed with column number.
With the wrong delimiter, and it loads one column per row
In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]:
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
...],
dtype='<U127')
In [26]: A.shape
Out[26]: (5,)
A 1d array with a long string dtype.
A CSV file might loaded in various ways, some intentional, some not. You have to look at the results, and try to understand them before blindly trying to index columns.
Upvotes: 1