Reputation: 4343
I'm trying to import a txt with strings and number columns using numpy.genfromtxt function. Essentially I need an array of strings. Here is a sample txt giving me trouble:
H2S 1.4
C1 3.6
The txt is codified as unicode. Here's the code I'm using:
import numpy as np
decodf= lambda x: x.decode('utf-16')
sample = np.genfromtxt(('ztest.txt'), dtype=str,
converters = {0:decodf, 1:decodf},
delimiter='\t',
usecols=0)
print(sample)
Here's the output:
['H2S' 'None']
I've tried several ways to fix this issue. By putting dtype=None and eliminating the converter, I get:
[b'\xff\xfeH\x002\x00S' b'\x00g\x00\xe8\x00n']
I also tried eliminating the converter and putting dtype=str and got:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
I understand this is a troublesome function. I saw different options (eg: here) but couldn't get anyone to work.
What am I doing wrong? In the meantime, I'm looking into Pandas... Thanks in advance
Upvotes: 0
Views: 1874
Reputation: 114956
Your file is encoded as UTF-16, and the first two characters are the BOM.
Try this (with python 2.7):
import io
import numpy as np
with io.open('ztest.txt', 'r', encoding='UTF-16') as f:
data = np.genfromtxt(f, delimiter='\t', dtype=None, usecols=[0]) # or dtype=str
genfromtxt
has some issues when run in python 3 with Unicode files. As a work-around, you could simply encode the lines before before passing them to genfromtxt
. For example, the following encodes each line as latin-1 before passing the lines to genfromtxt
:
import io
import numpy as np
with io.open('ztest.txt', 'r', encoding='UTF-16') as f:
lines = [line.encode('latin-1') for line in f]
data = np.genfromtxt(lines, delimiter='\t', dtype=None, usecols=[0])
Upvotes: 1