user3218088
user3218088

Reputation: 253

Comparing Two CSV in Python

I am formatting CSV in Python to get the desired result but my code doesn't seem to work properly.

I have the first CSV file in format:

2,a
1,a
4,a
5,a
3,a
1,a
3,b
2,b
1,a

The second CSV file in format:

1,a,123
1,a,234
2,a,456
2,b,345
3,a,789
3,b,232
4,a,987

As the first CSV file is not sorted, the second CSV file is sorted in increasing order with respect to first column

I want the output in format:

2,a,456
1,a,123
4,a,987
5,a
3,a,789
1,a,234
3,b,232
2,b,345
1,a

The result is printed in respect to the first CSV, if the first CSV file combination is not found in the second CSV File, for example if 5,a is not in the second CSV file then just 5,a is printed in its respected position. The first CSV file contains many duplicates, while in the second CSV file each row is unique.

Here is my Code

for (num,alpha) in first_csv:
    value_found = True
    for (num1,alpha1,num2) in second_csv:
        if (num == num1 and alpha == alpha1):
            csv_out +=  str(num) + ',' + str(alpha) + ',' + str(number)
            value_found = False
    if value_found:
        count+=1
        if count == 1:
            csv_out += str(num) + ',' + str(alpha)

first_csv and second_csv are tuples I have created after reading the CSV files with the code:

with open('first_csv.csv') as f:
    f.readline()
    first_csv = tuple(csv.reader(f, delimiter=','))


with open('second_csv.csv') as f:
    f.readline()
    second_csv = tuple(csv.reader(f, delimiter=','))

But it is not printing the desired output, where am I doing wrong?

Upvotes: 2

Views: 202

Answers (2)

Janne Karila
Janne Karila

Reputation: 25207

This collects the num2 values from second file into a dictionary of deques. When matches are found, they are removed using deque.popleft to use each match only once, in the same order as they occurred in the second file.

from collections import defaultdict, deque

with open('second_csv.csv') as f:
    next(f) #skip header
    dic = defaultdict(deque)
    for num1,alpha1,num2 in csv.reader(f, delimiter=','):
        dic[num1, alpha1].append(num2)

with open('first_csv.csv') as f, open('out.csv', 'wb') as fout:
    next(f) #skip header
    csv_out = csv.writer(fout)
    for num,alpha in csv.reader(f, delimiter=','):
        try:
            num2 = dic[num,alpha].popleft()
            csv_out.writerow([num,alpha,num2])
        except IndexError:
            csv_out.writerow([num,alpha])
  • a deque retains the order of items that you add by append and remove by popleft.
  • a dict does fast lookup by key which can be a tuple
  • a defaultdict(deque) is a dict that creates an empty deque automatically when you access a missing key, so you can append to it directly.

Upvotes: 3

Steinar Lima
Steinar Lima

Reputation: 7821

This should do the trick. Note that, for each iteration over first_csv, the worst-case scenario is that it would have to iterate to the end of what's left of the data from the second csv file (the rows gets pop-ed if matched).

import csv

with open("second_csv.csv") as in_file:
    reader = csv.reader(in_file)
    lookup = list(reader)

with open("first_csv.csv") as in_file, open('output.csv', 'wb') as out_file:
    reader = csv.reader(in_file)
    writer = csv.writer(out_file)
    for row in reader:
        for i, data in enumerate(lookup):
            if row == data[:2]:
                row = lookup.pop(i)
                break
        writer.writerow(row)

output.csv

2,a,456
1,a,123
4,a,987
5,a
3,a,789
1,a,234
3,b,232
2,b,345
1,a

Upvotes: 2

Related Questions