Reputation: 145
My question is how to create a numpy (np) array from a csv file which has columns of type int, and string. I found np.genfromtxt to be documented as the function of choice for this [1,2]. I am using python 3.5.1 and numpy 1.11.0. However, the latest numpy documentation I found is for 1.10.0 [3]. As I get an numpy error further down, this might be of interest.
Let me start with what I have
import numpy as np
from io import BytesIO
# Define the input
input = "1,3,Hello\n2,4,World"
# Create a structured np.array from input by reading from BytesIO.
output = np.genfromtxt(BytesIO(input.encode()),
delimiter=',',
dtype=None)
# output.dtype.names -> ('f0', 'f1', 'f2')
Here, the columns f0 and f1 are of type int, and f2 is a byte array. Thus
output['f2'] == 'Hello' # -> False
is false, as the type differs. A proper comparison must be written as
output['f2'] == b'Hello' # -> [True, False]
I would prefere to compare to a string and not a byte array. Thus, I want f2 to be of type str. The solutions should be to state the types of each column explicitley. According to [1], this should be possible by setting the genfromtxt argument
dtype=(int, int, str)
such that the minimal example now reads
import numpy as np
from io import BytesIO
# Define the input
input = "1,3,Hello\n2,4,World"
# Create a structured np.array from input by reading from BytesIO.
output = np.genfromtxt(BytesIO(input.encode()),
delimiter=',',
dtype=(int, int, str))
However, this results in a TypeError: data type not understood
. Maybe, something has changed between numpy version 1.10.0 and 1.11.0. In any case, I can not get this to work.
Therefore, I tried a second approach using the converters argument of genfromtxt. With this argument, values can be transformed by a function. The example now reads
import numpy as np
from io import BytesIO
# Define the input
input = "1,3,Hello\n2,4,World"
# Create a structured np.array from input by reading from BytesIO.
output = np.genfromtxt(BytesIO(input.encode()),
delimiter=',',
dtype=None,
converters={2: lambda x: str(x, encoding='utf-8')})
By doing so, f2 is in deed of type <U
, which I interpret as utf-8 in little-endian encoding, but only an empty string ''
is present is each row of f2.
So, how can I read in the given data such that f0 and f1 are int and f2 is str?
[1] http://docs.scipy.org/doc/numpy-1.10.1/user/basics.io.genfromtxt.html
[2] http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html
[3] http://docs.scipy.org/doc/numpy/
Upvotes: 2
Views: 1635
Reputation: 1708
The dtype
code for Unicode strings is U
. For work with fixed-sized blocks, the length is needed. In this case U5
is sufficient:
>>> np.genfromtxt(BytesIO(input.encode()),
delimiter=',',
dtype=(int, int, 'U5'))
array([(1, 3, 'Hello'), (2, 4, 'World')],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<U5')])
Upvotes: 3