Reputation: 691
I am working on a script that will write a massive amount of data to a .csv
file. To make the data transport among interested users easier, I would like to implement a limit on the number of rows per file. For example, I would like for the first million records to be written to some_csv_file_1.csv
and the second million records to be written to some_csv_file_2.csv
, etc until all records have been written.
I have attempted to get the following to work:
import csv
csv_record_counter = 1
csv_file_counter = 1
while csv_record_counter <= 1000000:
with open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w') as csvfile:
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csv_record_counter += 1
while not csv_record_counter <= 1000000:
csv_record_counter = 1
csv_file_counter += 1
Problem: As records increase beyond 1000000, the subsequent files are not created. The script continues to add records to the original file.
Upvotes: 4
Views: 7506
Reputation: 111
try writefile.flush()
after using writer.writerow()
with open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w') as csvfile:
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csvfile.flush() # whenever you want
or
csvfile = open('some_csv_file_' + str(csv_file_counter) + '.csv', 'w')
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csvfile.flush() # whenever you want, and/or
csvfile.close() # when you're done.
That flush() statement will clear the buffer which makes the ram free to get new tasks done.
While dealing with heavy number of lines, the buffer will be filled up with the tasks and it will not be cleared until you current running code exits.
So it's better to manually clear the buffer Everytime you use write statement to write something in your file
Upvotes: 1
Reputation: 8673
I think your data transfer could be successful with the above class:
import csv
class Writer:
def __init__(self, max_row):
self.max_row = max_row
self.cur_row = 0
self.file_number = 0
self.file_handle = None
def write_row(self, row):
if self.cur_row >= self.max_row or self.file_handle == None:
self.cur_row = 0
self.file_number += 1
if self.file_handle:
self.file_handle.close()
self.file_handle = open(f'some_csv_file_{self.file_number}.csv', 'w', newline='')
self.csv_handle = csv.writer(self.file_handle)
self.csv_handle.writerow(row)
self.cur_row += 1
writer = Writer(10) # 1000000 for you
for row in range(55): # massive amount of data
output_row = [row+1, "record1", "record2"]
writer.write_row(output_row)
In the example, 10 records per file (some_csv_file_1.csv
, some_csv_file_2.csv
, ...) are currently being generated.
output:
For you:
output_writer = Writer(1000000)
output_writer.write_row(['record'])
Upvotes: 0
Reputation: 2455
First indent your second while-loop and remove the "not". Then use a for- instead of a while-loop to create your csvs. Additionally, don't forget to reset your csv_record_counter.
import csv
csv_record_counter = 1
rows = #Your number of rows to process
additional_file = 1 if rows/1000000 % 2 != 0 else 0
for csv_file in range(1, int(rows/1000000) + 1 + additional_file): #Set rows as your maximum number of rows / This will return your number of csv to create
with open('some_csv_file_' + str(csv_file) + '.csv', 'w') as csvfile:
output_writer = csv.writer(csvfile, lineterminator = "\n")
output_writer.writerow(['record'])
csv_record_counter = 1 #Remove your "+"
while csv_record_counter <= 1000000: #Remove your "not"
csv_record_counter += 1
output_writer.writerow("your record")
Edit: Added additional_file
Upvotes: 2
Reputation: 8960
I like to batch my data before exporting it.
def batch(iterable, n=1):
length = len(iterable)
for ndx in range(0, length, n):
yield iterable[ndx:min(ndx + n, length)]
headers = [] # Your headers
products = [] # Milions of products go here
batch_size = int(len(db_products) / 4) # Example
# OR in your case, batch_size = 1000000000
for idx, product_batch in enumerate(batch(products, batch_size)):
with open('products_{}.csv'.format(idx + 1), 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=headers)
writer.writeheader()
for product in product_batch:
writer.writerow(product)
References:
Upvotes: 1