Find bi-directional matching IDs

Question

My input file,

ID1 ID2 value
ID3 ID6 value  
ID2 ID1 value
ID4 ID5 value
ID6 ID5 value
ID5 ID4 value
ID7 ID2 value

Desired output, file1.txt

ID1 ID2 value   ID2 ID1 value
ID4 ID5 value   ID5 ID4 value

file2.txt

ID3 ID6 value   
ID6 ID5 value
ID7 ID2 value

I am trying to get bi-dicrectional best matches. if have an ID1 that has a hit ID2, ID2 also has as a hit ID1, print in file1, otherwise in file2. What I tried to do is to create a copy of the input file and create a dictionary.But this gives outputs without the values (10 columns). How to modify it?

fileA = open("input.txt",'r')
fileB = open("input_copy.txt",'r')
output = open("out.txt",'w')

dictA = dict()
for line1 in fileA:
    new_list=line1.rstrip('
').split('	')
    query=new_list[0]
    subject=new_list[1]
    dictA[query] = subject
dictB = dict()
for line1 in fileB:
    new_list=line1.rstrip('
').split('	')
    query=new_list[0]
    subject=new_list[1]
    dictB[query] = subject
SharedPairs ={}
NotSharedPairs ={}
for id1 in dictA.keys():
    value1=dictA[id1]
    if value1 in dictB.keys():
        if id1 == dictB[value1]:
            SharedPairs[value1] = id1
        else:
            NotSharedPairs[value1] = id1
for key in SharedPairs.keys():
    ine = key +'	' + SharedPairs[key]+'
'
    output.write(line)
for key in NotSharedPairs.keys():
    line = key +'	' + NotSharedPairs[key]+'
'
    output2.write(line)

Reut Sharabani · Accepted Answer

You can use sets to solve it easily:

#!/usr/bin/env python

# ordered pairs (ID1, ID2)
oset = set()
# reversed pairs (ID2, ID1)
rset = set()

with open('input.txt') as f:
    for line in f:
        first, second, val = line.strip().split()
        if first < second:
            oset.add((first, second, val,))
        else:
            # note that this reverses second and first for matching purposes
            rset.add((second, first, val,))

print "common: %s" % str(oset & rset)
print "diff: %s" % str(oset ^ rset)

Output:

common: set([('ID4', 'ID5', 'value'), ('ID1', 'ID2', 'value')])
diff: set([('ID3', 'ID6', 'value'), ('ID5', 'ID6', 'value'), ('ID2', 'ID7', 'value')])

It doesn't handle pairs with (ID1, ID1) but you could add it to a third set and do what you decide with it.

Find bi-directional matching IDs

Answers (2)

Related Questions