Angelo
Angelo

Reputation: 5059

Intersection between multiple files

I have multiple files (say3). Each file has only one column. Looks like following:

File A

America
Russia
China
UK

File B

India
China
Russia

File C

China
America
Russia
Iran

Now, for computing the intersection or say to get the common elements in all files, I do

python -c 'import sys;print "".join(sorted(set.intersection(*[set(open(a).readlines()) for a in sys.argv[1:]])))' File1 File2 File3 File4

But, if I also need to know the pairwise overlap between these file, how can I loop the process? so that I get a set of elements that are present in all of them and also the elements that are present in A&B, A&C, c&B.

Help in python will be appreciated.

Kindly help

Upvotes: 0

Views: 1019

Answers (2)

sshashank124
sshashank124

Reputation: 32189

You can simply use set for that:

>>> print list(set(open(f1)) & set(open(f2)) & set(open(f3)))

For specific files, you can do:

>>> print list(set(open(f1)) & set(open(f2)))
>>> print list(set(open(f1)) & set(open(f3)))
>>> print list(set(open(f2)) & set(open(f3)))

As per @HerrActress's suggestion, this will take care of the \n part of the string:

[i.strip() for i in (set(open(f1)) & set(open(f2)))]

Upvotes: 1

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250941

To get lines that are common to all files you can use:

for f in sys.argv[1:]:
    data = []
    with open(f) as inp:
           lines = set(line.rstrip() for line in  inp)
           data.append(lines)
    common_lines = data[0].intersection(*data[1:])

For the second part use itertools.combinations:

from itertools import combinations

for f1, f2 in combinations(sys.argv[1:], 2):
    with open(f1) as inp1, open(f2) as inp2:
        print set(line.rstrip() for line in inp1).intersection(map(str.rstrip,
                                                                           inp2))

Upvotes: 2

Related Questions