Comparing data in two CSV files

Question

I have two CSV files which contain all the products in the database and currently the files are being compared using Excel formulas which is a long process. (approx. 130,000 rows in each file)

I have written a script in Python which works well with small sample data, however it isn't practical in the real world

CSV layout is:

ID, Product Title, Cost, Price1, Price2, Price3, Status

import csv

data_old = []
data_new = []

with open(file_path_old) as f1:
    data = csv.reader(f1, delimiter=",")
    next(data)
    for row in data:
        data_old.append(row)
f1.close()

with open(file_path_new) as f2:
    data = csv.reader(f2, delimiter=",")
    for row in data:
        data_new.append(row)
f2.close()

for d1 in data_new:
    for d2 in data_old:
        if d2[0] == d1[0]:
            # If match check rest of data in the same row
            if d2[1] != d1[1]:
                ...
            if d2[2] != d1[2]:
                ...

The issue with the above is as it is a nested for loop its going through each row of the second data 130,000 times (Slow is an understatement)

What I'm trying to achieve is to get a list of all the products which have had a change in either the title, cost, any of the 3 prices and status as well as a boolean flag to show which data has changed from the previous weeks data.

Desired Output CSV Format:

ID, Old Title, New Title, Changed, Old Cost, New Cost, Changed....

123, ABC, ABC, False, £12, £13, True....

SOLUTION:

import pandas as pd
# Read CSVs
old = pd.read_csv(old_file, sep=",")
new = pd.read_csv(new_file, sep=",")

# Join data together in single data table
df_join = pd.concat([old.set_index('PARTNO'), new.set_index('PARTNO'], axis='columns', key=['Old', 'New'])

# Displays data side by side
df_swap = pd.swaplevel(axis='columns')[old.columns[1:]]

# Output to CSV
out = df_swap.to_csv(output_file)

Tommy-Xavier Robillard · Accepted Answer

Just use pandas

import pandas as pd
old = pd.read_csv(file_path_old, sep=',')
new = pd.read_csv(file_path_new, sep=',')

Then you can do whatever (just read the doc). For example, to compare the titles:

old['Title'] == new['Title'] gives you an array of booleans for every row in your file.

Comparing data in two CSV files

Answers (2)

Related Questions