Reputation: 396
I have a CSV dataset that looks like this:
FirstAge,SecondAge,FirstCountry,SecondCountry,Income,NAME
41,41,USA,UK,113764,John
53,43,USA,USA,145963,Fred
47,37,USA,UK,42857,Dan
47,44,UK,USA,95352,Mark
I'm trying to load it into Python 3.6 with this code:
>>> from numpy import genfromtxt
>>> my_data = genfromtxt('first.csv', delimiter=',')
>>> print(train_data)
Output:
[[ nan nan nan nan
nan nan]
[ 4.10000000e+01 4.10000000e+01 nan nan
1.13764000e+05 nan]
[ 5.30000000e+01 4.30000000e+01 nan nan
1.45963000e+05 nan]
...,
[ 2.10000000e+01 3.00000000e+01 nan nan
1.19929000e+05 nan]
[ 6.90000000e+01 6.40000000e+01 nan nan
1.52667000e+05 nan]
[ 2.00000000e+01 1.90000000e+01 nan nan
1.05077000e+05 nan]]
I've looked at the Numpy docs and I don't see anything about this.
Upvotes: 0
Views: 259
Reputation: 231738
With a few general paramters genfromtxt
can read this file (in PY3 here):
In [100]: data = np.genfromtxt('stack43444219.txt', delimiter=',', names=True, dtype=None)
In [101]: data
Out[101]:
array([(41, 41, b'USA', b'UK', 113764, b'John'),
(53, 43, b'USA', b'USA', 145963, b'Fred'),
(47, 37, b'USA', b'UK', 42857, b'Dan'),
(47, 44, b'UK', b'USA', 95352, b'Mark')],
dtype=[('FirstAge', '<i4'), ('SecondAge', '<i4'), ('FirstCountry', 'S3'), ('SecondCountry', 'S3'), ('Income', '<i4'), ('NAME', 'S4')])
This is a structured array. 2 fields are integer, 2 are string (byte string by default), another integer, and string.
The default genfromtxt
reads all lines as data. I uses names=True
to get to use the first line a field names.
It also tries to read all strings a float (default dtype). The string columns then load as nan
.
All of this is in the genfromtxt
docs. Admittedly they are long, but they aren't hard to find.
Access fields by name, data['FirstName']
etc.
Using thecsv
reader gives a 2d array of strings:
In [102]: ls =list(csv.reader(open('stack43444219.txt','r')))
In [103]: ls
Out[103]:
[['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income', 'NAME'],
['41', '41', 'USA', 'UK', '113764', 'John'],
['53', '43', 'USA', 'USA', '145963', 'Fred'],
['47', '37', 'USA', 'UK', '42857', 'Dan'],
['47', '44', 'UK', 'USA', '95352', 'Mark']]
In [104]: arr=np.array(ls)
In [105]: arr
Out[105]:
array([['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income',
'NAME'],
['41', '41', 'USA', 'UK', '113764', 'John'],
['53', '43', 'USA', 'USA', '145963', 'Fred'],
['47', '37', 'USA', 'UK', '42857', 'Dan'],
['47', '44', 'UK', 'USA', '95352', 'Mark']],
dtype='<U13')
Upvotes: 1
Reputation: 1853
I think the an issue that you could be running into is the data that you are trying to parse is not all numerics and this could potentially cause unexpected behavior.
One way to detect the types would be to try and identify the types before they are added to your array. For example:
for obj in my_data:
if type(obj) == int:
# process or add your data to numpy
else:
# cast or discard the data
Upvotes: -1
Reputation: 5389
Alternative from using pandas
is to use csv
library
import csv
import numpy as np
ls = list(csv.reader(open('first.csv', 'r')))
val_array = np.array(ls)[1::] # exclude first row (columns name)
Upvotes: 1
Reputation: 244301
You could use the dtype
argument:
import numpy as np
output = np.genfromtxt("main.csv", delimiter=',', skip_header=1, dtype='f, f, |S6, |S6, f, |S6')
print(output)
Output:
[( 41., 41., b'USA', b'UK', 113764., b'John')
( 53., 43., b'USA', b'USA', 145963., b'Fred')
( 47., 37., b'USA', b'UK', 42857., b'Dan')
( 47., 44., b'UK', b'USA', 95352., b'Mark')]
Upvotes: 1
Reputation: 27899
Go with pandas
, it will save you the trouble:
import pandas as pd
df = pd.read_csv('first.csv')
print(df)
Upvotes: 2