Analytical360
Analytical360

Reputation: 45

numpy.genfromtxt csv file with null characters

I'm working on a scientific graphing script, designed to create graphs from csv files output by Agilent's Chemstation software.

I got the script working perfectly when the files come from one version of Chemstation (The version for liquid chromatography).

Now i'm trying to port it to work on our GC (Gas Chromatography). For some reason, this version of chemstation inserts nulls in between each character in any text file it outputs.

I'm trying to use numpy.genfromtxt to get the x,y data into python in order to create the graphs (using matplotlib).

I originally used:

data = genfromtxt(directory+signal, delimiter = ',') 

to load the data in. When I do this with a csv file generated by our GC, I get an array of all 'nan' values. If I set the dtype to none, I get 'byte strings' that look like this:

b'\x00 \x008\x008\x005\x00.\x002\x005\x002\x001\x007\x001\x00\r'

What I need is a float, for the above string it would be 885.252171.

Anyone have any idea how I can get where I need to go?

And just to be clear, I couldn't find any setting on Chemstation that would affect it's output to just not create files with nulls.

Thanks

Jeff

Upvotes: 1

Views: 1113

Answers (1)

Warren Weckesser
Warren Weckesser

Reputation: 114781

Given that your file is encoded as utf-16-le with a BOM, and all the actual unicode codepoints (except the BOM) are less than 128, you should be able to use an instance of codecs.EncodedFile to transcode the file from utf-16 to ascii. The following example works for me.

Here's my test file:

$ cat utf_16_le_with_bom.csv 
??2.0,19
1.5,17
2.5,23
1.0,10
3.0,5

The first two bytes, ff and fe are the BOM U+FEFF:

$ hexdump utf_16_le_with_bom.csv 
0000000 ff fe 32 00 2e 00 30 00 2c 00 31 00 39 00 0a 00
0000010 31 00 2e 00 35 00 2c 00 31 00 37 00 0a 00 32 00
0000020 2e 00 35 00 2c 00 32 00 33 00 0a 00 31 00 2e 00
0000030 30 00 2c 00 31 00 30 00 0a 00 33 00 2e 00 30 00
0000040 2c 00 35 00 0a 00                              
0000046

Here's the python script genfromtxt_utf16.py (updated for Python 3):

import codecs
import numpy as np

fh = open('utf_16_le_with_bom.csv', 'rb')
efh = codecs.EncodedFile(fh, data_encoding='ascii', file_encoding='utf-16')
a = np.genfromtxt(efh, delimiter=',')
fh.close()

print("a:")
print(a)

With python 3.4.1 and numpy 1.8.1, the script works:

$ python3.4 genfromtxt_utf16.py 
a:
[[  2.   19. ]
 [  1.5  17. ]
 [  2.5  23. ]
 [  1.   10. ]
 [  3.    5. ]]

Be sure that you don't specify the encoding as file_encoding='utf-16-le'. If the endian suffix is included, the BOM is not stripped, and it can't be transcoded to ascii.

Upvotes: 2

Related Questions