ReedWood
ReedWood

Reputation: 145

Create a numpy array of mixed column types from data using genfromtxt

My question is how to create a numpy (np) array from a csv file which has columns of type int, and string. I found np.genfromtxt to be documented as the function of choice for this [1,2]. I am using python 3.5.1 and numpy 1.11.0. However, the latest numpy documentation I found is for 1.10.0 [3]. As I get an numpy error further down, this might be of interest.

Let me start with what I have

import numpy as np
from io import BytesIO

# Define the input
input = "1,3,Hello\n2,4,World"

# Create a structured np.array from input by reading from BytesIO.
output = np.genfromtxt(BytesIO(input.encode()),
                       delimiter=',',
                       dtype=None)

# output.dtype.names -> ('f0', 'f1', 'f2')

Here, the columns f0 and f1 are of type int, and f2 is a byte array. Thus

output['f2'] == 'Hello'  # -> False

is false, as the type differs. A proper comparison must be written as

output['f2'] == b'Hello' # -> [True, False]

I would prefere to compare to a string and not a byte array. Thus, I want f2 to be of type str. The solutions should be to state the types of each column explicitley. According to [1], this should be possible by setting the genfromtxt argument

dtype=(int, int, str)

such that the minimal example now reads

import numpy as np
from io import BytesIO

# Define the input
input = "1,3,Hello\n2,4,World"

# Create a structured np.array from input by reading from BytesIO.
output = np.genfromtxt(BytesIO(input.encode()),
                       delimiter=',',
                       dtype=(int, int, str))

However, this results in a TypeError: data type not understood. Maybe, something has changed between numpy version 1.10.0 and 1.11.0. In any case, I can not get this to work.

Therefore, I tried a second approach using the converters argument of genfromtxt. With this argument, values can be transformed by a function. The example now reads

import numpy as np
from io import BytesIO

# Define the input
input = "1,3,Hello\n2,4,World"

# Create a structured np.array from input by reading from BytesIO.
output = np.genfromtxt(BytesIO(input.encode()),
                       delimiter=',',
                       dtype=None,
                       converters={2: lambda x: str(x, encoding='utf-8')})

By doing so, f2 is in deed of type <U, which I interpret as utf-8 in little-endian encoding, but only an empty string '' is present is each row of f2.

So, how can I read in the given data such that f0 and f1 are int and f2 is str?

[1] http://docs.scipy.org/doc/numpy-1.10.1/user/basics.io.genfromtxt.html

[2] http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html

[3] http://docs.scipy.org/doc/numpy/

Upvotes: 2

Views: 1635

Answers (1)

Heiko Oberdiek
Heiko Oberdiek

Reputation: 1708

The dtype code for Unicode strings is U. For work with fixed-sized blocks, the length is needed. In this case U5 is sufficient:

>>> np.genfromtxt(BytesIO(input.encode()),
                      delimiter=',',
                      dtype=(int, int, 'U5'))
array([(1, 3, 'Hello'), (2, 4, 'World')], 
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<U5')])

Upvotes: 3

Related Questions