Reputation: 63
I have two files that contain two columns for each one, in the first column the lines are often similar.
file1.csv
C(2)—C(1) 1.5183
C(3)—C(2) 1.49
C(3)—C(1) 1.4991
O(4)—C(3) 1.4104
H(10)—O(4) 0.964
C(2)—C(1)—C(3) 59.19
C(3)—C(1)—H(5) 118.4
file2.csv
C(2)—C(1) 1.5052
C(3)—C(2) 1.505
C(3)—C(1) 1.5037
S(4)—C(3) 1.7976
H(10)—S(4) 1.3445
C(2)—C(1)—H(6) 117.68
C(2)—C(1)—C(3) 60.3
C(3)—C(1)—H(5) 116.99
here is a python script "using itertools" which compares the first colone in file1.csv and file2.csv and then prints the lines similar.
import itertools
files = ['file1.csv', 'file2.csv']
d = {}
for fi, f in enumerate(files):
fh = open(f)
for line in fh:
sl = line.split()
name = sl[0]
val = float(sl[1])
if name not in d:
d[name] = {}
if fi not in d[name]:
d[name][fi] = []
d[name][fi].append(val)
fh.close()
for name, vals in d.items():
if len(vals) == len(files):
for var in itertools.product(*vals.values()):
if max(var) - min(var) <= 20:
out = '{}\t{}'.format(name, "\t".join(map(str, var)))
print(out)
break
output.csv
C(2)-C(1) 1.5183 1.5052
C(3)-C(2) 1.49 1.505
C(3)-C(1) 1.4991 1.5037
C(2)-C(1)-C(3) 59.19 60.3
C(3)-C(1)-H(5) 118.4 116.99
but I do not find an idea to print also the different lines.
my desired output:
similar_lines
C(2)-C(1) 1.5183 1.5052
C(3)-C(2) 1.49 1.505
C(3)-C(1) 1.4991 1.5037
C(2)-C(1)-C(3) 59.19 60.3
C(3)-C(1)-H(5) 118.4 116.99
different_lines
O(4)-C(3) 1.4104 non
H(10)-O(4) 0.964 non
S(4)-C(3) non 1.7976
H(10)-S(4) non 1.3445
C(2)-C(1)-H(6) non 117.68
Upvotes: 2
Views: 1828
Reputation: 71471
You can use itertools.groupby
:
import itertools, csv
file1 = [i+[True] for i in list(csv.reader(open('filename1.csv')))]
file2 = [i+[False] for i in list(csv.reader(open('filename2.csv')))]
new_data = [[a, list(b)] for a, b in itertools.groupby(sorted(file1+file2, key=lambda x:x[0]), key=lambda x:x[0])]
similar = ['{} {}'.format(a, ' '.join(h for _, h, flag in b)) for a, b in new_data if len(b) > 1]
different = ['{} {}'.format(a, 'non {}'.format(b[0][1]) if not b[0][-1] else '{} non'.format(b[0][1])) for a, b in new_data if len(b) == 1]
last_output = 'similar_lines\n{}\n\ndifferent_lines\n{}'.format('\n'.join(similar), '\n'.join(different))
Output:
similar_lines
C (2)-C(1) 1.5183 1.5052
C (2)-C(1)-C(3) 59.19 60.3
C (3)-C(1) 1.4991 1.5037
C (3)-C(1)-H(5) 118.4 116.99
C (3)-C(2) 1.49 1.505
different_lines
C (2)-C(1)-H(6) non 117.68
H (10)-O(4) 0.964 non
H (10)-S(4) non 1.3445
O (4)-C(3) 1.4104 non
S (4)-C(3) non 1.7976
Upvotes: 1