Reputation: 691

Python CSV writer automatically limit rows per file and create new files

I am working on a script that will write a massive amount of data to a .csv file. To make the data transport among interested users easier, I would like to implement a limit on the number of rows per file. For example, I would like for the first million records to be written to some_csv_file_1.csv and the second million records to be written to some_csv_file_2.csv, etc until all records have been written.

I have attempted to get the following to work:

import csv
csv_record_counter = 1
csv_file_counter = 1

while csv_record_counter <= 1000000:
    with open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w') as csvfile:
        output_writer = csv.writer(csvfile, lineterminator = "\n")
        output_writer.writerow(['record'])
        csv_record_counter += 1
while not csv_record_counter <= 1000000:
    csv_record_counter = 1
    csv_file_counter += 1

Problem: As records increase beyond 1000000, the subsequent files are not created. The script continues to add records to the original file.

Upvotes: 4

Answers (4)

Ajay Dyavathi

Reputation: 111

try writefile.flush() after using writer.writerow()

with open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w') as csvfile:
    output_writer = csv.writer(csvfile, lineterminator = "\n")
    output_writer.writerow(['record'])
    csvfile.flush() # whenever you want

csvfile = open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w')
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csvfile.flush() # whenever you want, and/or
csvfile.close() # when you're done.

That flush() statement will clear the buffer which makes the ram free to get new tasks done.

While dealing with heavy number of lines, the buffer will be filled up with the tasks and it will not be cleared until you current running code exits.

So it's better to manually clear the buffer Everytime you use write statement to write something in your file

Upvotes: 1

Milovan Tomašević

Reputation: 8673

I think your data transfer could be successful with the above class:

import csv

class Writer:
    def __init__(self, max_row):
        self.max_row = max_row
        self.cur_row = 0
        self.file_number = 0
        self.file_handle = None

    def write_row(self, row):
        if self.cur_row >= self.max_row or self.file_handle == None:
            self.cur_row = 0
            self.file_number += 1

            if self.file_handle:
                self.file_handle.close()

            self.file_handle = open(f'some_csv_file_{self.file_number}.csv', 'w', newline='')
            self.csv_handle = csv.writer(self.file_handle)

        self.csv_handle.writerow(row)
        self.cur_row += 1


writer = Writer(10) # 1000000 for you

for row in range(55): # massive amount of data
    output_row = [row+1, "record1", "record2"]
    writer.write_row(output_row)

In the example, 10 records per file (some_csv_file_1.csv, some_csv_file_2.csv, ...) are currently being generated.

output:

For you:

output_writer = Writer(1000000)
output_writer.write_row(['record'])

Upvotes: 0

Markus

Reputation: 2455

First indent your second while-loop and remove the "not". Then use a for- instead of a while-loop to create your csvs. Additionally, don't forget to reset your csv_record_counter.

import csv
csv_record_counter = 1

rows = #Your number of rows to process

additional_file = 1 if rows/1000000 % 2 != 0 else 0

for csv_file in range(1, int(rows/1000000) + 1 + additional_file): #Set rows as your maximum number of rows / This will return your number of csv to create
    with open('some_csv_file_' + str(csv_file) + '.csv', 'w') as csvfile:
        output_writer = csv.writer(csvfile, lineterminator = "\n")
        output_writer.writerow(['record'])
        csv_record_counter = 1 #Remove your "+"
        while csv_record_counter <= 1000000: #Remove your "not"
            csv_record_counter += 1
            output_writer.writerow("your record")

Edit: Added additional_file

Upvotes: 2

Georgian

Reputation: 8960

I like to batch my data before exporting it.

def batch(iterable, n=1):
    length = len(iterable)
    for ndx in range(0, length, n):
        yield iterable[ndx:min(ndx + n, length)]

headers = []  # Your headers
products = []  # Milions of products go here
batch_size = int(len(db_products) / 4)  # Example
# OR in your case, batch_size = 1000000000

for idx, product_batch in enumerate(batch(products, batch_size)):
    with open('products_{}.csv'.format(idx + 1), 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=headers)
        writer.writeheader()
        for product in product_batch:     
            writer.writerow(product)

References:

Upvotes: 1

Python CSV writer automatically limit rows per file and create new files

Answers (4)

Related Questions