Ruben Sotolongo
Ruben Sotolongo

Reputation: 31

Pandas appending csv that is too large to load and might need new columns

I currently have a csv file that is far too large to load into my machine and I would like to append data to it without having to load the whole thing, the problem is I might need to dynamically add new columns. I don't need to change their order but add new columns.

Ex. Huge.csv with columns

A B C
1 1 1

I need to add data to A B C as well as add the column D if it isn't already there. Leaving the csv

A B C D
1 1 1
2 2 2 2

If a csv isn't suited for this task I could use different file format, but ultimately it will need to be compatible with Tableau so i think the overlap in filetypes between tableau and pandas is csv, json, and sql though I dont know how to make or handle and sql file.

Upvotes: 3

Views: 852

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

Adding a new column involves reading, updating and writing every row in the CSV file. This though can be done a row at a time (without using Pandas) so it would work regardless of how huge the file is.

  1. Open both the input CSV and output CSV files.
  2. Skip over the header and write the correct header to the output file.
  3. For each row, determine if the D column is missing and if needed append a blank entry to the row.
  4. Write the updated row to the output file.
  5. Finally, append any new data to the end of the file.

For example:

import csv

# Create new CSV file with missing D column (only needed once)

with open('huge.csv', newline='') as f_input, open('huge updated.csv', 'w', newline='')  as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)
    
    header = next(csv_input)    # skip existing header
    csv_output.writerow(['A', 'B', 'C', 'D'])
    
    for row in csv_input:
        if len(row) == 3:    # Is D column missing?
            row.append('')
            
        csv_output.writerow(row)

    # Append new data

    data = [[2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4]]
    csv_output.writerows(data)

I would assume that the updating phase would only need to be done once, but the appending phase multiple times. If this is the case, split the script into two and run the update phase just once. Rows can be appended to a file without needing to read the whole file in as follows:

with open('huge updated.csv', 'a', newline='')  as f_output:
    data = [[2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4]]
    csv_writer(f_output).writerows(data)

Upvotes: 2

Related Questions