Edamame
Edamame

Reputation: 25366

python - numpy: read csv into numpy with proper value type

Here is my test_data.csv:

A,1,2,3,4,5
B,6,7,8,9,10
C,11,12,13,14,15
A,16,17,18,19,20

And I am reading it to a numpy array using the code below:

def readCSVToNumpyArray(dataset):
    with open(dataset) as f:
        values = [i for i in csv.reader(f)]

    data = numpy.array(values)

    return data

In the main code, I have:

    numpyArray = readCSVToNumpyArray('test_data.csv')
    print(numpyArray)

which gives me the output:

(array([['A', '1', '2', '3', '4', '5'],
       ['B', '6', '7', '8', '9', '10'],
       ['C', '11', '12', '13', '14', '15'],
       ['A', '16', '17', '18', '19', '20']], 
      dtype='|S2'))

But all the numbers in the array is treated as string, is there a good way to make them stored as float without going through each element and assign the type?

Thanks!

Upvotes: 2

Views: 1701

Answers (3)

kmh
kmh

Reputation: 1586

I'd read it in using Pandas which lets you set dtype per column very easily.

import numpy as np 
import pandas as pd 

pdDF = pd.read_csv(
    'test_data.csv', 
    header=None, 
    names=list('abcdef'), 
    dtype=dict(zip(list('abcdef'),[str]+[float]*5)))

now each column will have the appropriate dtype.

pdDF.b
Out[24]: 
0     1
1     6
2    11
3    16
Name: b, dtype: float64

If you still want it in numpy arrays, you can just take values.

npArr = pdDF.values

npArr
Out[27]: 
array([['A', 1.0, 2.0, 3.0, 4.0, 5.0],
       ['B', 6.0, 7.0, 8.0, 9.0, 10.0],
       ['C', 11.0, 12.0, 13.0, 14.0, 15.0],
       ['A', 16.0, 17.0, 18.0, 19.0, 20.0]], dtype=object)

It's still going to be objects for the 'row' arrays, because you can't make 'A' into a float, but the individual values will be floats as desired.

type(npArr[0,1])
Out[28]: float

Finally if you want just an array of floats, that's also easy enough... just spit out all but the first column as an array and it will have dtype: float instead of object.

pdDF.loc[:,pdDF.columns>='b'].values
Out[28]: 
array([[  1.,   2.,   3.,   4.,   5.],
       [  6.,   7.,   8.,   9.,  10.],
       [ 11.,  12.,  13.,  14.,  15.],
       [ 16.,  17.,  18.,  19.,  20.]])

pdDF.loc[:,pdDF.columns>='b'].values.dtype
Out[29]: dtype('float64')

Upvotes: 1

hpaulj
hpaulj

Reputation: 231385

np.genfromtxt can easily load your data into a structured array. It will be a 1d array, with a field for each column:

Simulate the file with a list of lines:

   In [265]: txt=b"""A,1,2,3,4,5
       .....: B,6,7,8,9,10
       .....: C,11,12,13,14,15
       .....: A,16,17,18,19,20"""
    In [266]: txt=txt.splitlines()
    In [267]: A=np.genfromtxt(txt,delimiter=',',names=None,dtype=None)
    In [268]: A
    Out[268]: 
    array([(b'A', 1, 2, 3, 4, 5), (b'B', 6, 7, 8, 9, 10),
           (b'C', 11, 12, 13, 14, 15), (b'A', 16, 17, 18, 19, 20)], 
          dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<i4')])

It deduced the dtype from the column values - strings and ints. Fields are accessed by name

In [269]: A['f0']
Out[269]: 
array([b'A', b'B', b'C', b'A'], 
      dtype='|S1')
In [270]: A['f1']
Out[270]: array([ 1,  6, 11, 16])

I could also define a dtype that would put the strings in one field, and all the other values in another field.

In [271]: A=np.genfromtxt(txt,delimiter=',',names=None,dtype='S2,(5)int')
In [272]: A
Out[272]: 
array([(b'A', [1, 2, 3, 4, 5]), (b'B', [6, 7, 8, 9, 10]),
       (b'C', [11, 12, 13, 14, 15]), (b'A', [16, 17, 18, 19, 20])], 
      dtype=[('f0', 'S2'), ('f1', '<i4', (5,))])
In [273]: A['f1']
Out[273]: 
array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

Upvotes: 0

armatita
armatita

Reputation: 13465

Since the first character at each line is a string you'll have to use a more flexible type in numpy called "object". Try with this function and see if this is what you are looking for:

    def readCSVToNumpyArray(dataset):
        values = [[]]
        with open(dataset) as f:
            counter = 0
            for i in csv.reader(f):
                for j in i:
                    try:
                        values[counter].append(float(j))
                    except ValueError:
                        values[counter].append(j)
                counter = counter + 1
                values.append([])

        data = numpy.array(values[:-1],dtype='object')

        return data

    numpyArray = readCSVToNumpyArray('test_data.csv')
    print(numpyArray)

The results are:

    [['A' 1.0 2.0 3.0 4.0 5.0]
     ['B' 6.0 7.0 8.0 9.0 10.0]
     ['C' 11.0 12.0 13.0 14.0 15.0]
     ['A' 16.0 17.0 18.0 19.0 20.0]]

Upvotes: 2

Related Questions