hatmatrix
hatmatrix

Reputation: 44862

How do I read CSV data into a record array in NumPy?

Is there a direct way to import the contents of a CSV file into a record array, just like how R's read.table(), read.delim(), and read.csv() import data into R dataframes?

Or should I use csv.reader() and then apply numpy.core.records.fromrecords()?

Upvotes: 575

Views: 1288039

Answers (14)

Ovu Sunday
Ovu Sunday

Reputation: 9

this is a very simple task, the best way to do this is as follows

import pandas as pd
import numpy as np


df = pd.read_csv(r'C:\Users\Ron\Desktop\Clients.csv')   #read the file (put 'r' before the path string to address any special characters in the file such as \). Don't forget to put the file name at the end of the path + ".csv"

print(df)`

y = np.array(df)

Upvotes: -1

Lee
Lee

Reputation: 31040

Use pandas.read_csv:

import pandas as pd
df = pd.read_csv('myfile.csv', sep=',', header=None)
print(df.values)
array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

This gives a pandas DataFrame which provides many useful data manipulation functions which are not directly available with numpy record arrays.

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table...


I would also recommend numpy.genfromtxt. However, since the question asks for a record array, as opposed to a normal array, the dtype=None parameter needs to be added to the genfromtxt call:

import numpy as np
np.genfromtxt('myfile.csv', delimiter=',')

For the following 'myfile.csv':

1.0, 2, 3
4, 5.5, 6

the code above gives an array:

array([[ 1. ,  2. ,  3. ],
       [ 4. ,  5.5,  6. ]])

and

np.genfromtxt('myfile.csv', delimiter=',', dtype=None)

gives a record array:

array([(1.0, 2.0, 3), (4.0, 5.5, 6)], 
      dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<i4')])

This has the advantage that files with multiple data types (including strings) can be easily imported.

Upvotes: 249

Andrew
Andrew

Reputation: 13191

Use numpy.genfromtxt() by setting the delimiter kwarg to a comma:

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

Upvotes: 884

William komp
William komp

Reputation: 1257

I tried it :

from numpy import genfromtxt
genfromtxt(fname = dest_file, dtype = (<whatever options>))

versus :

import csv
import numpy as np
with open(dest_file,'r') as dest_f:
    data_iter = csv.reader(dest_f,
                           delimiter = delimiter,
                           quotechar = '"')
    data = [data for data in data_iter]
data_array = np.asarray(data, dtype = <whatever options>)

on 4.6 million rows with about 70 columns and found that the NumPy path took 2 min 16 secs and the csv-list comprehension method took 13 seconds.

I would recommend the csv-list comprehension method as it is most likely relies on pre-compiled libraries and not the interpreter as much as NumPy. I suspect the pandas method would have similar interpreter overhead.

Upvotes: 94

Available on the newest pandas and numpy version.

import pandas as pd
import numpy as np

data = pd.read_csv('data.csv', header=None)

# Discover, visualize, and preprocess data using pandas if needed.

data = data.to_numpy()

Upvotes: 6

kdurant
kdurant

Reputation: 1

In [329]: %time my_data = genfromtxt('one.csv', delimiter=',')
CPU times: user 19.8 s, sys: 4.58 s, total: 24.4 s
Wall time: 24.4 s

In [330]: %time df = pd.read_csv("one.csv", skiprows=20)
CPU times: user 1.06 s, sys: 312 ms, total: 1.38 s
Wall time: 1.38 s

Upvotes: 0

btel
btel

Reputation: 5693

You can also try recfromcsv() which can guess data types and return a properly formatted record array.

Upvotes: 70

matthewpark319
matthewpark319

Reputation: 1263

This is the easiest way:

import csv
with open('testfile.csv', newline='') as csvfile:
    data = list(csv.reader(csvfile))

Now each entry in data is a record, represented as an array. So you have a 2D array. It saved me so much time.

Upvotes: 7

Nihal Sargaiya
Nihal Sargaiya

Reputation: 79

This work as a charm...

import csv
with open("data.csv", 'r') as f:
    data = list(csv.reader(f, delimiter=";"))

import numpy as np
data = np.array(data, dtype=np.float)

Upvotes: 7

Jatin Mandav
Jatin Mandav

Reputation: 51

I would suggest using tables (pip3 install tables). You can save your .csv file to .h5 using pandas (pip3 install pandas),

import pandas as pd
data = pd.read_csv("dataset.csv")
store = pd.HDFStore('dataset.h5')
store['mydata'] = data
store.close()

You can then easily, and with less time even for huge amount of data, load your data in a NumPy array.

import pandas as pd
store = pd.HDFStore('dataset.h5')
data = store['mydata']
store.close()

# Data in NumPy format
data = data.values

Upvotes: 6

HVNSweeting
HVNSweeting

Reputation: 2897

As I tried both ways using NumPy and Pandas, using pandas has a lot of advantages:

  • Faster
  • Less CPU usage
  • 1/3 RAM usage compared to NumPy genfromtxt

This is my test code:

$ for f in test_pandas.py test_numpy_csv.py ; do  /usr/bin/time python $f; done
2.94user 0.41system 0:03.05elapsed 109%CPU (0avgtext+0avgdata 502068maxresident)k
0inputs+24outputs (0major+107147minor)pagefaults 0swaps

23.29user 0.72system 0:23.72elapsed 101%CPU (0avgtext+0avgdata 1680888maxresident)k
0inputs+0outputs (0major+416145minor)pagefaults 0swaps

test_numpy_csv.py

from numpy import genfromtxt
train = genfromtxt('/home/hvn/me/notebook/train.csv', delimiter=',')

test_pandas.py

from pandas import read_csv
df = read_csv('/home/hvn/me/notebook/train.csv')

Data file:

du -h ~/me/notebook/train.csv
 59M    /home/hvn/me/notebook/train.csv

With NumPy and pandas at versions:

$ pip freeze | egrep -i 'pandas|numpy'
numpy==1.13.3
pandas==0.20.2

Upvotes: 27

chamzz.dot
chamzz.dot

Reputation: 775

You can use this code to send CSV file data into an array:

import numpy as np
csv = np.genfromtxt('test.csv', delimiter=",")
print(csv)

Upvotes: 7

Xiaojian Chen
Xiaojian Chen

Reputation: 189

Using numpy.loadtxt

A quite simple method. But it requires all the elements being float (int and so on)

import numpy as np 
data = np.loadtxt('c:\\1.csv',delimiter=',',skiprows=0)  

Upvotes: 10

muTheTechie
muTheTechie

Reputation: 1691

I tried this:

import pandas as p
import numpy as n

closingValue = p.read_csv("<FILENAME>", usecols=[4], dtype=float)
print(closingValue)

Upvotes: 4

Related Questions