Reputation: 25366
Here is my test_data.csv:
A,1,2,3,4,5
B,6,7,8,9,10
C,11,12,13,14,15
A,16,17,18,19,20
And I am reading it to a numpy array using the code below:
def readCSVToNumpyArray(dataset):
with open(dataset) as f:
values = [i for i in csv.reader(f)]
data = numpy.array(values)
return data
In the main code, I have:
numpyArray = readCSVToNumpyArray('test_data.csv')
print(numpyArray)
which gives me the output:
(array([['A', '1', '2', '3', '4', '5'],
['B', '6', '7', '8', '9', '10'],
['C', '11', '12', '13', '14', '15'],
['A', '16', '17', '18', '19', '20']],
dtype='|S2'))
But all the numbers in the array is treated as string
, is there a good way to make them stored as float
without going through each element and assign the type?
Thanks!
Upvotes: 2
Views: 1701
Reputation: 1586
I'd read it in using Pandas which lets you set dtype per column very easily.
import numpy as np
import pandas as pd
pdDF = pd.read_csv(
'test_data.csv',
header=None,
names=list('abcdef'),
dtype=dict(zip(list('abcdef'),[str]+[float]*5)))
now each column will have the appropriate dtype.
pdDF.b
Out[24]:
0 1
1 6
2 11
3 16
Name: b, dtype: float64
If you still want it in numpy arrays, you can just take values.
npArr = pdDF.values
npArr
Out[27]:
array([['A', 1.0, 2.0, 3.0, 4.0, 5.0],
['B', 6.0, 7.0, 8.0, 9.0, 10.0],
['C', 11.0, 12.0, 13.0, 14.0, 15.0],
['A', 16.0, 17.0, 18.0, 19.0, 20.0]], dtype=object)
It's still going to be objects for the 'row' arrays, because you can't make 'A' into a float, but the individual values will be floats as desired.
type(npArr[0,1])
Out[28]: float
Finally if you want just an array of floats, that's also easy enough... just spit out all but the first column as an array and it will have dtype: float instead of object.
pdDF.loc[:,pdDF.columns>='b'].values
Out[28]:
array([[ 1., 2., 3., 4., 5.],
[ 6., 7., 8., 9., 10.],
[ 11., 12., 13., 14., 15.],
[ 16., 17., 18., 19., 20.]])
pdDF.loc[:,pdDF.columns>='b'].values.dtype
Out[29]: dtype('float64')
Upvotes: 1
Reputation: 231385
np.genfromtxt
can easily load your data into a structured array. It will be a 1d array, with a field for each column:
Simulate the file with a list of lines:
In [265]: txt=b"""A,1,2,3,4,5
.....: B,6,7,8,9,10
.....: C,11,12,13,14,15
.....: A,16,17,18,19,20"""
In [266]: txt=txt.splitlines()
In [267]: A=np.genfromtxt(txt,delimiter=',',names=None,dtype=None)
In [268]: A
Out[268]:
array([(b'A', 1, 2, 3, 4, 5), (b'B', 6, 7, 8, 9, 10),
(b'C', 11, 12, 13, 14, 15), (b'A', 16, 17, 18, 19, 20)],
dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4')])
It deduced the dtype
from the column values - strings and ints. Fields are accessed by name
In [269]: A['f0']
Out[269]:
array([b'A', b'B', b'C', b'A'],
dtype='|S1')
In [270]: A['f1']
Out[270]: array([ 1, 6, 11, 16])
I could also define a dtype
that would put the strings in one field, and all the other values in another field.
In [271]: A=np.genfromtxt(txt,delimiter=',',names=None,dtype='S2,(5)int')
In [272]: A
Out[272]:
array([(b'A', [1, 2, 3, 4, 5]), (b'B', [6, 7, 8, 9, 10]),
(b'C', [11, 12, 13, 14, 15]), (b'A', [16, 17, 18, 19, 20])],
dtype=[('f0', 'S2'), ('f1', '<i4', (5,))])
In [273]: A['f1']
Out[273]:
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20]])
Upvotes: 0
Reputation: 13465
Since the first character at each line is a string you'll have to use a more flexible type in numpy called "object". Try with this function and see if this is what you are looking for:
def readCSVToNumpyArray(dataset):
values = [[]]
with open(dataset) as f:
counter = 0
for i in csv.reader(f):
for j in i:
try:
values[counter].append(float(j))
except ValueError:
values[counter].append(j)
counter = counter + 1
values.append([])
data = numpy.array(values[:-1],dtype='object')
return data
numpyArray = readCSVToNumpyArray('test_data.csv')
print(numpyArray)
The results are:
[['A' 1.0 2.0 3.0 4.0 5.0]
['B' 6.0 7.0 8.0 9.0 10.0]
['C' 11.0 12.0 13.0 14.0 15.0]
['A' 16.0 17.0 18.0 19.0 20.0]]
Upvotes: 2