Reading items from a csv and updating the same items in another csv

Question

I'm working on a method to read data from input.csv, and update stock column in output.csv based on the product's id

These are the steps I'm working on right now:

1. Read product info from input.csv into input_data = [], which will return a list of OrderedDict.

input_data currently looks like this:

[OrderedDict([('id', '1'), ('name', 'a'), ('stock', '33')]), OrderedDict([('id', '2'), ('name', 'b'), ('stock', '66')]), OrderedDict([('id', '3'), ('name', 'c'), ('stock', '99')])]

2. Read current product info from output.csv into output_data = [], which has the same schema as input_data

3. Iterate through input_data and update the stock column in output_data based on stock info in input_data. What's the best way to do this?

-> An important mention is that in input_data there might be some IDs which exist in input_data but do not exist in output_data. I would like to update the stocks for ids common to input_data and output_data, and the "new" ids would most likely be written to a new csv.

I was thinking of something like (this isn't real code):

for p in input_data:
    # check if p['id'] exists in the list of output_data IDs (I might have to create a list of IDs in output_data for this as well, in order to check it against input_data IDs
    # if p['id'] exists in output_data, write the Stock to the corresponding product in output_data
    # else, append p to another_csv

I know this looks pretty messy, what I'm asking is for a logical way to approach this mission without wasting too much compute time. The files in question are going to be 100,000 rows long, probably, so performance and speed will be an issue.

If my data from input_data and output_data are a list of OrderedDict , what is the best way to check the id in input_data and write the stock to the product with the exact same id in output_data?

Mihai · Accepted Answer

While Python might not be your best option, I wouldn't use lists of OrderDict for this task. This is simply because trying to change something within output_data would require O(n) complexity which will simply transform your script in O(n**2). I would save the two files in dicts (or OrderedDicts if you care about order), like this (and reduce the complexity of the whole thing to O(n)):

input_data = {
    '1': ['a', '33'],
    '2': ['b', '66'],
    '3': ['c', '99']
}
output_data = {
    '1': ['a', '31'],
    '3': ['c', '95']
}

# iterate through all keys in input_data and update output_data
# if a key does not exist in output_data, create it in a different dict
new_data = {}
for key in input_data:
    if key not in output_data:
        new_data[key] = input_data[key]
        # for optimisation's sake you could append data into the new file here
        # and not save into a new dict
    else:
        output_data[key][1] = input_data[key][1]
        # for optimisation's sake you could append data into a new output file here
        # and rename/move the new output file into the old output file after the script finishes

Reading items from a csv and updating the same items in another csv

Answers (1)

Related Questions