clcordell
clcordell

Reputation: 57

Comparing data in two CSV files

I have two CSV files which contain all the products in the database and currently the files are being compared using Excel formulas which is a long process. (approx. 130,000 rows in each file)

I have written a script in Python which works well with small sample data, however it isn't practical in the real world

CSV layout is:

ID, Product Title, Cost, Price1, Price2, Price3, Status

import csv

data_old = []
data_new = []

with open(file_path_old) as f1:
    data = csv.reader(f1, delimiter=",")
    next(data)
    for row in data:
        data_old.append(row)
f1.close()

with open(file_path_new) as f2:
    data = csv.reader(f2, delimiter=",")
    for row in data:
        data_new.append(row)
f2.close()

for d1 in data_new:
    for d2 in data_old:
        if d2[0] == d1[0]:
            # If match check rest of data in the same row
            if d2[1] != d1[1]:
                ...
            if d2[2] != d1[2]:
                ...

The issue with the above is as it is a nested for loop its going through each row of the second data 130,000 times (Slow is an understatement)

What I'm trying to achieve is to get a list of all the products which have had a change in either the title, cost, any of the 3 prices and status as well as a boolean flag to show which data has changed from the previous weeks data.

Desired Output CSV Format:

ID, Old Title, New Title, Changed, Old Cost, New Cost, Changed....

123, ABC, ABC, False, £12, £13, True....

SOLUTION:

import pandas as pd
# Read CSVs
old = pd.read_csv(old_file, sep=",")
new = pd.read_csv(new_file, sep=",")

# Join data together in single data table
df_join = pd.concat([old.set_index('PARTNO'), new.set_index('PARTNO'], axis='columns', key=['Old', 'New'])

# Displays data side by side
df_swap = pd.swaplevel(axis='columns')[old.columns[1:]]

# Output to CSV
out = df_swap.to_csv(output_file)

Upvotes: 2

Views: 277

Answers (2)

bigh_29
bigh_29

Reputation: 2633

Do you care about new and removed products? If not, then you can get O(n) performance by using a dictionary. Pick one CSV file and shove it into a dictionary keyed by id. Use lookups into the dictionary to find products that changed. Note that I simplified your data down to one column for brevity.

data_old = [
    (1, 'alpha'),
    (2, 'bravo'),
    (3, 'delta'),
    (5, 'echo')
]

data_new = [
    (1, 'alpha'),
    (2, 'zulu'),
    (4, 'foxtrot'),
    (6, 'mike'),
    (7, 'lima'),
]

changed_products = []
new_product_map = {id: product for (id, product) in data_new}
for id, old_product in data_old:
    if id in new_product_map and new_product_map[id] != old_product:
        changed_products.append(id)

print('Changed products: ', changed_products)

You can shorten this even more using a list comprehension

new_product_map = {id: product for (id, product) in data_new}
changed_products = [id for (id, old_product) in data_old if id in new_product_map and new_product_map[id] != old_product]
print('Changed products: ', changed_products)

The diff algorithm below can also track insertions and deletions. You can use it if your CSV files are sorted by id. You can sort the data in O(n*Lg(n)) time after loading it if the CSV files have no sensible order. Proceed with the diff after sorting. Either way, this will be faster than the O(n^2) loops in your original post:

data_old = # same setup as before
data_new = # ditto

old_index = 0
new_index = 0

new_products = []
deleted_products = []
changed_products = []

while old_index < len(data_old) and new_index < len(data_new):
    (old_id, old_product) = data_old[old_index]
    (new_id, new_product) = data_new[new_index]

    if old_id < new_id:
        print('Product removed : %d' % old_id)
        deleted_products.append(old_id)
        old_index += 1
    elif new_id < old_id:
        print('Product added : %d' % new_id)
        new_products.append(new_id)
        new_index += 1
    else:
        if old_product != new_product:
            print ('Product %d changed from %s to %s' %(old_id, old_product, new_product))
            changed_products.append(old_id)
        else:
            print ('Product %d did not change' % old_id)

        old_index += 1
        new_index += 1

if old_index != len(data_old):
    num_deleted = len(data_old) - old_index
    print('The last %d old items were deleted' % num_deleted)
    deleted_products += [id for (id, _) in data_old[old_index:]]
elif new_index != len(data_new):
    num_added = len(data_new) - new_index
    print('The last %d ne items were completely new' % num_added)
    new_products += [id for (id, _) in data_new[new_index:]]

print('New products: ', new_products)
print('Changed products: ', changed_products)
print('Deleted products: ', deleted_products)

PS: The suggestion to use pandas is a great one. Use it if possible.

Upvotes: 0

Tommy-Xavier Robillard
Tommy-Xavier Robillard

Reputation: 339

Just use pandas

import pandas as pd
old = pd.read_csv(file_path_old, sep=',')
new = pd.read_csv(file_path_new, sep=',')

Then you can do whatever (just read the doc). For example, to compare the titles:

old['Title'] == new['Title'] gives you an array of booleans for every row in your file.

Upvotes: 1

Related Questions