Mary Ziemba
Mary Ziemba

Reputation: 25

Numpy not accepting strings correctly?

I have some data in a CSV that is formatted as such (I deleted some columns for simplicity):

Year,Region,Round,Diff
2014,South,Second Round,-24
2015,West,First Round,48
# ...lots of rows of this

I want to use both the string data in the Region and Round columns and the integer data in the Diff column.

Here is my relevant code:

import sklearn
import numpy as np
from numpy import genfromtxt
from StringIO import StringIO

# Some other code...

my_dtype=[('Year', int), ('Region', str),('Round', str),('Diff', int)] 
data = np.genfromtxt(my_file, delimiter=',',names=True,dtype=my_dtype)
print data

When I print my data, I get the following. NumPy is making every string an empty string.

[ ( 2014, '', '', -24)
( 2010, '', '', 48)
...]

Does anyone know how I could fix this? Am I using the dtype attribute wrong? Or something else? Thanks in advance.

Upvotes: 0

Views: 49

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114811

Instead of putting str for the data type of the text fields, use the S format with a maximum string length:

In [10]: my_dtype = [('Year', int), ('Region', 'S8'), ('Round', 'S16'), ('Diff', int)] 

In [11]: data = np.genfromtxt('regions.csv', delimiter=',', names=True, dtype=my_dtype)

In [12]: data
Out[12]: 
array([(2014, b'South', b'Second Round', -24),
       (2015, b'West', b'First Round',  48)], 
      dtype=[('Year', '<i8'), ('Region', 'S8'), ('Round', 'S16'), ('Diff', '<i8')])

You can also use dtype=None and let genfromtxt() determine the data type for you:

In [13]: data = np.genfromtxt('regions.csv', delimiter=',', names=True, dtype=None)

In [14]: data
Out[14]: 
array([(2014, b'South', b'Second Round', -24),
       (2015, b'West', b'First Round',  48)], 
      dtype=[('Year', '<i8'), ('Region', 'S5'), ('Round', 'S12'), ('Diff', '<i8')])

Upvotes: 1

Related Questions