ahmed redha
ahmed redha

Reputation: 63

compare two columns (in two files), then print the similar lines and different lines

I have two files that contain two columns for each one, in the first column the lines are often similar.

file1.csv

C(2)—C(1) 1.5183
C(3)—C(2) 1.49
C(3)—C(1) 1.4991
O(4)—C(3) 1.4104
H(10)—O(4) 0.964
C(2)—C(1)—C(3) 59.19
C(3)—C(1)—H(5) 118.4

file2.csv

C(2)—C(1) 1.5052
C(3)—C(2) 1.505
C(3)—C(1) 1.5037
S(4)—C(3) 1.7976
H(10)—S(4) 1.3445
C(2)—C(1)—H(6) 117.68
C(2)—C(1)—C(3) 60.3
C(3)—C(1)—H(5) 116.99

here is a python script "using itertools" which compares the first colone in file1.csv and file2.csv and then prints the lines similar.

import itertools

files = ['file1.csv', 'file2.csv']
d = {}

for fi, f in enumerate(files):
    fh = open(f)
    for line in fh:
        sl = line.split()
        name = sl[0]
        val = float(sl[1])
        if name not in d:
            d[name] = {}
        if fi not in d[name]:
            d[name][fi] = []
        d[name][fi].append(val)
    fh.close()

for name, vals in d.items():
    if len(vals) == len(files):
        for var in itertools.product(*vals.values()):
            if max(var) - min(var) <= 20:
                out = '{}\t{}'.format(name, "\t".join(map(str, var)))
                print(out)
                break

output.csv

C(2)-C(1)        1.5183   1.5052
C(3)-C(2)        1.49     1.505
C(3)-C(1)        1.4991   1.5037
C(2)-C(1)-C(3)   59.19    60.3
C(3)-C(1)-H(5)   118.4    116.99

but I do not find an idea to print also the different lines.

my desired output:

similar_lines
C(2)-C(1)           1.5183    1.5052
C(3)-C(2)           1.49      1.505
C(3)-C(1)           1.4991    1.5037
C(2)-C(1)-C(3)      59.19     60.3
C(3)-C(1)-H(5)      118.4     116.99

different_lines
O(4)-C(3)           1.4104      non
H(10)-O(4)          0.964       non
S(4)-C(3)            non       1.7976
H(10)-S(4)           non       1.3445
C(2)-C(1)-H(6)       non       117.68

Upvotes: 2

Views: 1828

Answers (1)

Ajax1234
Ajax1234

Reputation: 71471

You can use itertools.groupby:

import itertools, csv
file1 = [i+[True] for i in list(csv.reader(open('filename1.csv')))]
file2 = [i+[False] for i in list(csv.reader(open('filename2.csv')))]
new_data = [[a, list(b)] for a, b in itertools.groupby(sorted(file1+file2, key=lambda x:x[0]), key=lambda x:x[0])]
similar = ['{} {}'.format(a, '     '.join(h for _, h, flag in b)) for a, b in new_data if len(b) > 1]
different = ['{} {}'.format(a, 'non  {}'.format(b[0][1]) if not b[0][-1] else '{}     non'.format(b[0][1])) for a, b in new_data if len(b) == 1]
last_output = 'similar_lines\n{}\n\ndifferent_lines\n{}'.format('\n'.join(similar), '\n'.join(different))

Output:

similar_lines
C (2)-C(1) 1.5183     1.5052
C (2)-C(1)-C(3) 59.19     60.3
C (3)-C(1) 1.4991     1.5037
C (3)-C(1)-H(5) 118.4     116.99
C (3)-C(2) 1.49     1.505

different_lines
C (2)-C(1)-H(6) non  117.68
H (10)-O(4) 0.964     non
H (10)-S(4) non  1.3445
O (4)-C(3) 1.4104     non
S (4)-C(3) non  1.7976

Upvotes: 1

Related Questions