Reputation: 5565
I am currently reading in a large csv file (around 100 million lines), using command along the lines of that described in https://docs.python.org/2/library/csv.html e.g. :
import csv
with open('eggs.csv', 'rb') as csvfile:
spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in spamreader:
process_row(row)
This is proving rather slow, I suspect because each line is read in individually (requiring lots of read calls to the hard drive). Is there any way of reading the whole csv file in at once, and then iterating over it? Although the file itself is large in size (e.g. 5Gb), my machine has sufficient ram to hold that in memory.
Upvotes: 0
Views: 439
Reputation: 16
You can also use chunksize
in read_csv
to read it in pieces and process it:
# chunksize is defined here
df = pd.read_csv("path/test.csv", chunksize=10000)
for data in df:
print(data.shape)
Upvotes: 0
Reputation: 876
If your csv file larger then your ram then you can use
Dask Dataframe from Dask Official ... Dask Wikipedia
with dask dataframe you can do data analysis even if you have big dataset
Upvotes: 0
Reputation: 168596
Yes, there is a way to read the entire file at once:
with open('eggs.csv', 'rb', 5000000000) as ...:
...
Reference: https://docs.python.org/2/library/functions.html#open
Upvotes: 1
Reputation: 78536
my machine has sufficient ram to hold that in memory.
Well then, call list
on the iterator:
spamreader = list(csv.reader(csvfile, delimiter=' ', quotechar='|'))
Upvotes: 1
Reputation: 1980
import pandas as pd
df =pd.DataFrame.from_csv('filename.csv')
This will read it in as a pandas dataframe so you can do all sorts of fun things with it
Upvotes: 3