Python - Script that appends rows; checks for duplicates before writing

Question

I'm writing a script that has a for loop to extract a list of variables from each 'data_i.csv' file in a folder, then appends that list as a new row in a single 'output.csv' file.

My objective is to define the headers of the file once and then append data to the 'output.csv' container-file so it will function as a backlog for a standard measurement. The first time I run the script it will add all the files in the folder. Next time I run it, I want it to only append files that have been added since. I thought one way of doing this would be to check for duplicates, but the codes I found for that so far only searched for consecutive duplicates.

Do you have suggestions?

Here's how I made it so far:

import csv, os

# Find csv files
for csvFilename in os.listdir('.'):
    if not csvFilename.endswith('.csv'):
            continue    

# Read in csv file and choose certain cells
    csvRows = [] 
    csvFileObj = open(csvFilename) 
    csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True') 
    csvLines = list(csvData) 

    cellID = csvLines[4][3] 

# Read in several variables...

    csvRows = [cellID]

    csvFileObj.close() 

    resultFile = open("Output.csv", 'a') #open in 'append' modus
    wr = csv.writer(resultFile) 
    wr.writerows([csvRows])     
    csvFileObj.close()
    resultFile.close()

This is the final script after mgc's answer:

import csv, os

f = open('Output.csv', 'r+')
merged_files = csv.reader(f)
merged_files = list()
for csvFilename in os.listdir('.'):
    if not csvFilename.endswith('_spm.txt'):
        continue
    if csvFilename in merged_files:
        continue            

    csvRows = [] 
    csvFileObj = open(csvFilename) 
    csvData = csv.reader(csvFileObj,delimiter=' ',skipinitialspace='True')
    csvLines = list(csvData)
    waferID = csvLines[4][3] 
    temperature = csvLines[21][2]

    csvRows = [waferID,thickness]
    merged_files.append(csvRows)
    csvFileObj.close() 

wr = csv.writer(f)
wr.writerows(merged_files)
f.close()

mgc · Accepted Answer

You can keep track of the name of each file already handled. If this log file don't need to be human readable, you can use pickle. At the start of your script, you can do :

import pickle

try:
    with open('merged_log', 'rb') as f:
        merged_files = pickle.load(f)
except FileNotFoundError:
    merged_files = set()

Then you can add a condition to avoid files previously treated :

if filename in merged_files: continue

Then when you are processing a file you can do :

merged_files.add(filename)

And keep trace of your variable at the end of your script (so it will be used on a next use) :

with open('merged_log', 'wb') as f:
    pickle.dump(merged_files, f)

(However there is other options to your problem, for example you can slightly change the name of your file once it has been processed, like changing the extension from .csv to .csv_ or moving processed files in a subfolder, etc.)

Also, in the example in your question, i don't think that you need to open (and close) your output file on each iteration of your for loop. Open it once before your loop, write what you have to write, then close it when you have leaved the loop.

Python - Script that appends rows; checks for duplicates before writing

Answers (1)

Related Questions