Reputation: 1311
I'm trying to import a large (approximately 4Gb) csv dataset into python using the pandas
library. Of course the dataset cannot fit all at once in the memory so I used chunks of size 10000 to read the csv.
After this I want to concat all the chunks into a single dataframe in order to perform some calculations but I ran out of memory (I use a desktop with 16gb RAM).
My code so far:
# Reading csv
chunks = pd.read_csv("path_to_csv", iterator=True, chunksize=1000)
# Concat the chunks
pd.concat([chunk for chunk in chunks])
pd.concat(chunks, ignore_index=True)
I searched many threads on StackOverflow and all of them suggest one of these solutions. Is there a way to overcome this? I can't believe I can't handle a 4 gb dataset with 16 gb ram!
UPDATE: I still haven't come up with any solution to import the csv file. I bypassed the problem by importing the data into a PostgreSQL then querying the database.
Upvotes: 1
Views: 2942
Reputation: 3061
I once deal with this kind of situation using generator in python. I hope this will be helpful:
def read_big_file_in_chunks(file_object, chunk_size=1024):
"""Reading whole big file in chunks."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('very_very_big_file.log')
for chunk in read_big_file_in_chunks(f):
process_data(chunck)
Upvotes: 1