Reputation:
CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded. String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.
Upvotes: 1
Views: 2432
Reputation: 1542
did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))
Upvotes: 1
Reputation: 6174
I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame
.
Upvotes: 0
Reputation: 23773
If you know in advance how many rows are in the data, you could dispense with the intermediate list
and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)
Upvotes: 3