Reputation: 5059
I have multiple files (say3). Each file has only one column. Looks like following:
File A
America
Russia
China
UK
File B
India
China
Russia
File C
China
America
Russia
Iran
Now, for computing the intersection or say to get the common elements in all files, I do
python -c 'import sys;print "".join(sorted(set.intersection(*[set(open(a).readlines()) for a in sys.argv[1:]])))' File1 File2 File3 File4
But, if I also need to know the pairwise overlap between these file, how can I loop the process? so that I get a set of elements that are present in all of them and also the elements that are present in A&B, A&C, c&B.
Help in python will be appreciated.
Kindly help
Upvotes: 0
Views: 1019
Reputation: 32189
You can simply use set
for that:
>>> print list(set(open(f1)) & set(open(f2)) & set(open(f3)))
For specific files, you can do:
>>> print list(set(open(f1)) & set(open(f2)))
>>> print list(set(open(f1)) & set(open(f3)))
>>> print list(set(open(f2)) & set(open(f3)))
As per @HerrActress's suggestion, this will take care of the \n
part of the string:
[i.strip() for i in (set(open(f1)) & set(open(f2)))]
Upvotes: 1
Reputation: 250941
To get lines that are common to all files you can use:
for f in sys.argv[1:]:
data = []
with open(f) as inp:
lines = set(line.rstrip() for line in inp)
data.append(lines)
common_lines = data[0].intersection(*data[1:])
For the second part use itertools.combinations:
from itertools import combinations
for f1, f2 in combinations(sys.argv[1:], 2):
with open(f1) as inp1, open(f2) as inp2:
print set(line.rstrip() for line in inp1).intersection(map(str.rstrip,
inp2))
Upvotes: 2