user2180519
user2180519

Reputation:

Efficient way to process CSV file into a numpy array

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded. String manipulation is required during processing.

Example input:

20150701 20:00:15.173,0.5019,0.91665

Desired output: float32 (pseudo-date, seconds in the day, f3, f4)

0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)

The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.

Looking for an efficient way to process the CSV file and end up with a numpy array.

Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.

Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.

Upvotes: 1

Views: 2432

Answers (3)

maswadkar
maswadkar

Reputation: 1542

did you think for using pandas read_csv (with engine='C')

I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.

import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))

Upvotes: 1

Phil
Phil

Reputation: 6174

I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

Upvotes: 0

wwii
wwii

Reputation: 23773

If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.

import numpy as np

no_rows = 5
no_columns = 4

a = np.zeros((no_rows, no_columns), dtype = np.float)

with open('myfile') as f:
    for i, line in enumerate(f):
        a[i,:] = cool_function_that_returns_formatted_data(line)

Upvotes: 3

Related Questions