playercharlie
playercharlie

Reputation: 629

Numpy.loadtxt imports data as array of arrays rather than a multidimension array

I have a csv file, which has the first three columns like this

2011,12,25,...
2011,12,26....
2011,12,27,...
...

These columns are basically year, month and date. The other columns contain strings. There are 100 rows and 6 columns in total. I use numpy.loadtxt to get this into an array, using

input = numpy.loadtxt('file.csv', dtype='i4, i4, i4, S4, S4, S4', delimiter=',')

Problem: As I understand, this loadtxt operation should should return an array which has a shape 100x6. However this returns an array of 100x1, with each element being an array of 1x6.

I want this to be normal 2D array of 100x6. I looked up some resources on the net. It seems that since some of the columns in the csv data contains strings, I have to use the dtype argument, and that results in the input being a 1D array of arrays rather than a 2D array. I have tried some of the examples given in these sites, and they seem to work fine as long as all the entries in the CSV file are numbers

What I am looking for is either


Sample CSV file:

2011,12,25,AAA,AAA,AAA
2011,12,26,BBB,BBB,BBB
2011,12,27,CCC,CCC,CCC

Upvotes: 2

Views: 3763

Answers (1)

Veedrac
Veedrac

Reputation: 60117

You are right that np.loadtxt returns a 1D array, but you can still access the 'columns', which are actually fields in a structured array:

array([(2011, 12, 25, b'AAA', b'AAA', b'AAA'),
       (2011, 12, 26, b'BBB', b'BBB', b'BBB'),
       (2011, 12, 27, b'CCC', b'CCC', b'CCC')], 
      dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S4'), ('f4', 'S4'), ('f5', 'S4')])

It does let you index the fields, but you need to do so by the names (f0, f1, f2...) and not indexes:

nt['f3']
#>>> array([b'AAA', b'BBB', b'CCC'], 
#>>>       dtype='|S4')

You can of course specify the dtype names:

dtype=[('MEAT', '<i4'), ('CHEESE', '<i4'), ('TOAST', '<i4'), ('BIRD', 'S4'), ('PLANE', 'S4'), ('SOCK', 'S4')]
nt = numpy.loadtxt('/home/joshua/file.csv', dtype=dtype, delimiter=',')

nt['SOCK']
#>>> array([b'AAA', b'BBB', b'CCC'], 
#>>>       dtype='|S4')

This is done to simplify a lot of complications that arise from non-homogeneous arrays.

Upvotes: 3

Related Questions