Reputation: 41
I have 3 files in excess in 10GB that I need to split into 6 smaller files. I would normally use something like R to load the files and partition it into smaller chunks but the size of the files prevents them from being read into R - even with 20GB of RAM.
I'm stuck on how to proceed next and would greatly appreciate any tips.
Upvotes: 3
Views: 3781
Reputation: 2072
In python, with use of generators/iterators you shouldn't load all the data in a memory.
Just read it line-by-line.
Csv library gives you a reader and writer classes, that would do the job.
To split your file you can write something like this:
import csv
# your input file (10GB)
in_csvfile = open('source.csv', "r")
# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)
# number of current line read
num = 0
# number of output file
output_file_num = 1
# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")
# writer should be constructed in a read loop,
# because we need csv headers to be already available
# to construct writer object
writer = None
for row in reader:
num += 1
# Here you have your data line in a row variable
# If writer doesn't exists, create one
if writer is None:
writer = csv.DictWriter(
out_csvfile,
fieldnames=row.keys(),
delimiter=",", quotechar='"', escapechar='"',
lineterminator='\n', quoting=csv.QUOTE_NONNUMERIC
)
# Write a row into a writer (out_csvfile, remember?)
writer.writerow(row)
# If we got a 10000 rows read, save current out file
# and create a new one
if num > 10000:
output_file_num += 1
out_csvfile.close()
writer = None
# create new file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")
# reset counter
num = 0
# Closing the files
in_csvfile.close()
out_csvfile.close()
I've didn't tested it, writed out of my head, so, bugs can exists :)
Upvotes: 2