AlanGhalan
AlanGhalan

Reputation: 396

Importing CSV into Python

I have a CSV dataset that looks like this:

FirstAge,SecondAge,FirstCountry,SecondCountry,Income,NAME
41,41,USA,UK,113764,John
53,43,USA,USA,145963,Fred
47,37,USA,UK,42857,Dan
47,44,UK,USA,95352,Mark  

I'm trying to load it into Python 3.6 with this code:

>>> from numpy import genfromtxt

>>> my_data = genfromtxt('first.csv', delimiter=',')
>>> print(train_data)

Output:

 [[             nan              nan              nan              nan
               nan              nan]
 [  4.10000000e+01   4.10000000e+01              nan              nan
    1.13764000e+05              nan]
 [  5.30000000e+01   4.30000000e+01              nan              nan
    1.45963000e+05              nan]
 ..., 
 [  2.10000000e+01   3.00000000e+01              nan              nan
    1.19929000e+05              nan]
 [  6.90000000e+01   6.40000000e+01              nan              nan
    1.52667000e+05              nan]
 [  2.00000000e+01   1.90000000e+01              nan              nan
    1.05077000e+05              nan]]

I've looked at the Numpy docs and I don't see anything about this.

Upvotes: 0

Views: 259

Answers (5)

hpaulj
hpaulj

Reputation: 231738

With a few general paramters genfromtxt can read this file (in PY3 here):

In [100]: data = np.genfromtxt('stack43444219.txt', delimiter=',', names=True, dtype=None)
In [101]: data
Out[101]: 
array([(41, 41, b'USA', b'UK', 113764, b'John'),
       (53, 43, b'USA', b'USA', 145963, b'Fred'),
       (47, 37, b'USA', b'UK',  42857, b'Dan'),
       (47, 44, b'UK', b'USA',  95352, b'Mark')], 
      dtype=[('FirstAge', '<i4'), ('SecondAge', '<i4'), ('FirstCountry', 'S3'), ('SecondCountry', 'S3'), ('Income', '<i4'), ('NAME', 'S4')])

This is a structured array. 2 fields are integer, 2 are string (byte string by default), another integer, and string.

The default genfromtxt reads all lines as data. I uses names=True to get to use the first line a field names.

It also tries to read all strings a float (default dtype). The string columns then load as nan.

All of this is in the genfromtxt docs. Admittedly they are long, but they aren't hard to find.

Access fields by name, data['FirstName'] etc.


Using thecsv reader gives a 2d array of strings:

In [102]: ls =list(csv.reader(open('stack43444219.txt','r')))
In [103]: ls
Out[103]: 
[['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income', 'NAME'],
 ['41', '41', 'USA', 'UK', '113764', 'John'],
 ['53', '43', 'USA', 'USA', '145963', 'Fred'],
 ['47', '37', 'USA', 'UK', '42857', 'Dan'],
 ['47', '44', 'UK', 'USA', '95352', 'Mark']]
In [104]: arr=np.array(ls)
In [105]: arr
Out[105]: 
array([['FirstAge', 'SecondAge', 'FirstCountry', 'SecondCountry', 'Income',
        'NAME'],
       ['41', '41', 'USA', 'UK', '113764', 'John'],
       ['53', '43', 'USA', 'USA', '145963', 'Fred'],
       ['47', '37', 'USA', 'UK', '42857', 'Dan'],
       ['47', '44', 'UK', 'USA', '95352', 'Mark']], 
      dtype='<U13')

Upvotes: 1

AgnosticDev
AgnosticDev

Reputation: 1853

I think the an issue that you could be running into is the data that you are trying to parse is not all numerics and this could potentially cause unexpected behavior.

One way to detect the types would be to try and identify the types before they are added to your array. For example:

for obj in my_data:
    if type(obj) == int:
        # process or add your data to numpy
    else:
        # cast or discard the data

Upvotes: -1

titipata
titipata

Reputation: 5389

Alternative from using pandas is to use csv library

import csv
import numpy as np
ls = list(csv.reader(open('first.csv', 'r')))
val_array = np.array(ls)[1::] # exclude first row (columns name)

Upvotes: 1

eyllanesc
eyllanesc

Reputation: 244301

You could use the dtype argument:

import numpy as np

output = np.genfromtxt("main.csv", delimiter=',', skip_header=1, dtype='f, f, |S6, |S6, f, |S6')

print(output)

Output:

[( 41.,  41., b'USA', b'UK',  113764., b'John')
 ( 53.,  43., b'USA', b'USA',  145963., b'Fred')
 ( 47.,  37., b'USA', b'UK',   42857., b'Dan')
 ( 47.,  44., b'UK', b'USA',   95352., b'Mark')]

Upvotes: 1

zipa
zipa

Reputation: 27899

Go with pandas, it will save you the trouble:

import pandas as pd

df = pd.read_csv('first.csv')
print(df)

Upvotes: 2

Related Questions