Reputation: 31
I currently have a csv file that is far too large to load into my machine and I would like to append data to it without having to load the whole thing, the problem is I might need to dynamically add new columns. I don't need to change their order but add new columns.
Ex. Huge.csv
with columns
A B C
1 1 1
I need to add data to A B C
as well as add the column D
if it isn't already there. Leaving the csv
A B C D
1 1 1
2 2 2 2
If a csv isn't suited for this task I could use different file format, but ultimately it will need to be compatible with Tableau so i think the overlap in filetypes between tableau and pandas is csv, json, and sql though I dont know how to make or handle and sql file.
Upvotes: 3
Views: 852
Reputation: 46759
Adding a new column involves reading, updating and writing every row in the CSV file. This though can be done a row at a time (without using Pandas) so it would work regardless of how huge the file is.
D
column is missing and if needed append a blank entry to the row.For example:
import csv
# Create new CSV file with missing D column (only needed once)
with open('huge.csv', newline='') as f_input, open('huge updated.csv', 'w', newline='') as f_output:
csv_input = csv.reader(f_input)
csv_output = csv.writer(f_output)
header = next(csv_input) # skip existing header
csv_output.writerow(['A', 'B', 'C', 'D'])
for row in csv_input:
if len(row) == 3: # Is D column missing?
row.append('')
csv_output.writerow(row)
# Append new data
data = [[2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4]]
csv_output.writerows(data)
I would assume that the updating phase would only need to be done once, but the appending phase multiple times. If this is the case, split the script into two and run the update phase just once. Rows can be appended to a file without needing to read the whole file in as follows:
with open('huge updated.csv', 'a', newline='') as f_output:
data = [[2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4]]
csv_writer(f_output).writerows(data)
Upvotes: 2