Reputation: 107

numpy.genfromtxt() can't read header

I am trying to use numpy.genfromtxt() to read a csv. file, but I can't make it read the header correctly.

As by default the function doesn't skip the header, but as the values in each column are numbers, it seems to set var type to float (for the entire column), at which point it detects the header row as a missing value and returns NaN.

Here is my code:

import numpy


dataset = numpy.loadtxt('datasets/BAL_dataset01.csv',
                        delimiter=',')
print(dataset[0:5])

Here is first 7 rows of my .csv:

patient_nr,Age,Native_CD45,LYM,Macr,NEU
1,48,35.8,3.4,92.5,3.7
1,48,14.5,12.6,78.3,1.2
1,48,12.1,5.6,87.1,4.3
1,48,5.6,25.9,72.7,0.4
1,49,13.2,N/A,N/A,N/A
2,18,43.0,17.9,76.2,4.2
3,59,53.2,1.07,47.8,49.6

And here is the resulting array:

[[ nan  nan  nan  nan  nan  nan]
 [ 1.  48.  35.8  3.4 92.5  3.7]
 [ 1.  48.  14.5 12.6 78.3  1.2]
 [ 1.  48.  12.1  5.6 87.1  4.3]
 [ 1.  48.   5.6 25.9 72.7  0.4]]

Process finished with exit code 0

I tried setting encoding to 'UTF-8-sig' and playing around with parameters, but to no avail. I tried numpy.loadtxt(), but it doesn't work for me since there are missing values within the dataset

The only solution that worked for me is to read the first row in a separate array and then concatenate them.

Is there a more elegant solution to reading the header as strings while preserving the float nature of the values? I am probably missing something trivial here.

Preferably using numpy or other package – I am not fond of creating for loops everywhere, aka reinventing the wheel while standing at the car park.

Thank you for any and all input.

Upvotes: 0

Answers (2)

Leporello

Reputation: 658

That is feasible with numpy, or even with the standard lib (csv), but I would suggest looking at the pandas package (whose whole point is the handling of CSV-like data).

import pandas as pd

file_to_read = r'path/to/your/csv'

res = pd.read_csv(file_to_read)
print(res)

The "N/A" will get out as NaN (for more options, see parameters na_values and keep_default_na in the doc for pandas.read_csv).

Upvotes: 2

Ondřej Janča

Reputation: 107

A solution by commenter hpaulj did the job for me:

Using names=True and dype=None (and possibly encoding=None), should produce a structured array. Look at it's shape` and dtype. Or use the skip_header parameter, and accept floats.

Also for anyone starting with numpy and not reading the full documentation like me: the names of columns are not stored in the array itself, but in its' .dtype.names. And because I didn't look there, I didn't see the code worked with names=True.

The working code:

import numpy

dataset = numpy.genfromtxt('datasets/BAL_dataset01.csv',
                           delimiter=',',
                           encoding='UTF-8-sig',
                           dtype=None,
                           names=True)

print(dataset[0:7])
print(dataset.dtype.names)

Upvotes: 1

numpy.genfromtxt() can&#39;t read header

Answers (2)

Related Questions

numpy.genfromtxt() can't read header