Splitting +10GB .csv File into equal Parts without reading into Memory

Question

I have 3 files in excess in 10GB that I need to split into 6 smaller files. I would normally use something like R to load the files and partition it into smaller chunks but the size of the files prevents them from being read into R - even with 20GB of RAM.

I'm stuck on how to proceed next and would greatly appreciate any tips.

MihanEntalpo · Accepted Answer

In python, with use of generators/iterators you shouldn't load all the data in a memory.

Just read it line-by-line.

Csv library gives you a reader and writer classes, that would do the job.

To split your file you can write something like this:

import csv

# your input file (10GB)
in_csvfile = open('source.csv', "r")

# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)

# number of current line read
num = 0

# number of output file
output_file_num = 1

# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

# writer should be constructed in a read loop, 
# because we need csv headers to be already available 
# to construct writer object
writer = None

for row in reader:
    num += 1

    # Here you have your data line in a row variable

    # If writer doesn't exists, create one
    if writer is None:
        writer = csv.DictWriter(
            out_csvfile, 
            fieldnames=row.keys(), 
            delimiter=",", quotechar='"', escapechar='"', 
            lineterminator='
', quoting=csv.QUOTE_NONNUMERIC
        )

    # Write a row into a writer (out_csvfile, remember?)
    writer.writerow(row)

    # If we got a 10000 rows read, save current out file
    # and create a new one
    if num > 10000:
        output_file_num += 1
        out_csvfile.close()
        writer = None

        # create new file
        out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

        # reset counter
        num = 0 

# Closing the files
in_csvfile.close()
out_csvfile.close()

I've didn't tested it, writed out of my head, so, bugs can exists :)

Splitting +10GB .csv File into equal Parts without reading into Memory

Answers (1)

Related Questions