Reputation: 107
I am trying to use numpy.genfromtxt() to read a csv. file, but I can't make it read the header correctly.
As by default the function doesn't skip the header, but as the values in each column are numbers, it seems to set var type to float (for the entire column), at which point it detects the header row as a missing value and returns NaN.
Here is my code:
import numpy
dataset = numpy.loadtxt('datasets/BAL_dataset01.csv',
delimiter=',')
print(dataset[0:5])
Here is first 7 rows of my .csv:
patient_nr,Age,Native_CD45,LYM,Macr,NEU
1,48,35.8,3.4,92.5,3.7
1,48,14.5,12.6,78.3,1.2
1,48,12.1,5.6,87.1,4.3
1,48,5.6,25.9,72.7,0.4
1,49,13.2,N/A,N/A,N/A
2,18,43.0,17.9,76.2,4.2
3,59,53.2,1.07,47.8,49.6
And here is the resulting array:
[[ nan nan nan nan nan nan]
[ 1. 48. 35.8 3.4 92.5 3.7]
[ 1. 48. 14.5 12.6 78.3 1.2]
[ 1. 48. 12.1 5.6 87.1 4.3]
[ 1. 48. 5.6 25.9 72.7 0.4]]
Process finished with exit code 0
I tried setting encoding to 'UTF-8-sig' and playing around with parameters, but to no avail. I tried numpy.loadtxt(), but it doesn't work for me since there are missing values within the dataset
The only solution that worked for me is to read the first row in a separate array and then concatenate them.
Is there a more elegant solution to reading the header as strings while preserving the float nature of the values? I am probably missing something trivial here.
Preferably using numpy or other package – I am not fond of creating for loops everywhere, aka reinventing the wheel while standing at the car park.
Thank you for any and all input.
Upvotes: 0
Views: 1970
Reputation: 658
That is feasible with numpy
, or even with the standard lib (csv
), but I would suggest looking at the pandas
package (whose whole point is the handling of CSV-like data).
import pandas as pd
file_to_read = r'path/to/your/csv'
res = pd.read_csv(file_to_read)
print(res)
The "N/A" will get out as NaN (for more options, see parameters na_values
and keep_default_na
in the doc for pandas.read_csv).
Upvotes: 2
Reputation: 107
A solution by commenter hpaulj did the job for me:
Using names=True and dype=None (and possibly encoding=None), should produce a structured array. Look at it's shape` and dtype. Or use the skip_header parameter, and accept floats.
Also for anyone starting with numpy and not reading the full documentation like me: the names of columns are not stored in the array itself, but in its' .dtype.names. And because I didn't look there, I didn't see the code worked with names=True.
The working code:
import numpy
dataset = numpy.genfromtxt('datasets/BAL_dataset01.csv',
delimiter=',',
encoding='UTF-8-sig',
dtype=None,
names=True)
print(dataset[0:7])
print(dataset.dtype.names)
Upvotes: 1