Find shared IDs with specific pattern

Question

I have a tab-delimited file with 100k lines:(EDITED)

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

I want to get ID's that match in both directions and have 1A-1B, 1B-1A, save in one file, the rest in another,so:

out.txt) PROT1B2 PROT1A1     rest.txt) PROT1A5  PROT1B6 
         PROT1A1 PROT1B2               PROT2A1  PROT2B2
                                       PROT1A2 PROT3B2
                                       PROT3B2 PROT1A2

My script gives me bidirectional IDs, but I don't know to to find for specific pattern, with re? I appreciate if you comment on your script, so I can understand and modify it..

fileA = open("input.txt",'r')
fileB = open("input_copy.txt",'r')
output = open("out.txt",'w')
out2=open("rest.txt",'w')
dictA = dict()
for line1 in fileA:
    new_list=line1.rstrip('
').split('	')
    query=new_list[0]
    subject=new_list[1]
    dictA[query] = subject
dictB = dict()
for line1 in fileB:
    new_list=line1.rstrip('
').split('	')
    query=new_list[0]
    subject=new_list[1]
    dictB[query] = subject
SharedPairs ={}
NotSharedPairs ={}
for id1 in dictA.keys():
    value1=dictA[id1]
    if value1 in dictB.keys():
        if id1 == dictB[value1]: # may be re should go here?
            SharedPairs[value1] = id1
        else:
            NotSharedPairs[value1] = id1
for key in SharedPairs.keys():
    line = key +'	' + SharedPairs[key]+'
'
    output.write(line)
for key in NotSharedPairs.keys():
    line = key +'	' + NotSharedPairs[key]+'
'
    out2.write(line)

Dilettant · Accepted Answer

For the final specification in question here comes the answer suggestion:

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f:  # read file in context secured block
    for line in i_f:
        line = line.rstrip()  # no trailing white space (includes 
 et al.)
        if line:
            pair = [a.strip() for a in line.split('	')]  # fragile split
            a_dict[pair[0]] = pair[1]
            b_dict[pair[1]] = pair[0]

must_match = sorted(('1A', '1B'))  # Accept only unordered 1A-1B pairs
fix_pos_slice = slice(4, 6)  # Sample: 'PROT1A1' has '1A' on slice(4, 6)

shared, rest = {}, {}
for key_a, val_a in a_dict.items():
    # Below we prepare a canidate for matching against the unordered pair
    fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
    if must_match == fix_pos_match_cand and b_dict.get(key_a) == val_a:
        shared[val_a] = key_a
    else:
        rest[val_a] = key_a

# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:  # Again secured in context block
        for key, val in data.items():
            o_f.write(key + '	' + val + '
')

Operating on the given (new) input.txt (expecting tabs between the words in a line!):

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

yields in out.txt:

PROT1A1 PROT1B2
PROT1B2 PROT1A1

and in rest.txt:

PROT1B6 PROT1A5
PROT2B2 PROT2A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2

Comments have been added to highlight some code portions.

Upon special request here a sketch how to extend to more pairs to match:

To use the same input file (but demonstrate different result) add a hypothetical 1A-3B pair (the 2A-2B is not present in input) to the allowed matches goes like this (one solution):

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f:  # read file in context secured block
    for line in i_f:
        line = line.rstrip()  # no trailing white space (includes 
 et al.)
        if line:
            pair = [a.strip() for a in line.split('	')]  # fragile split
            a_dict[pair[0]] = pair[1]
            b_dict[pair[1]] = pair[0]

must_match_once = sorted(  # Accept only unordered 1A-1B or 1A-3B pairs
    (sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6)  # Sample: 'PROT1A1' has '1A' on slice(4, 6)

shared, rest = {}, {}
for key_a, val_a in a_dict.items():
    # Below we prepare a canidate for matching against the unordered pair
    fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
    has_match = any(
        [must_match == fix_pos_match_cand for must_match in must_match_once])
    if has_match and b_dict.get(key_a) == val_a:
        shared[val_a] = key_a
    else:
        rest[val_a] = key_a

# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:  # Again secured in context block
        for key, val in data.items():
            o_f.write(key + '	' + val + '
')

Operating on the given (identical) input.txt (note still expecting tabs between the words in a line!):

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

yields in out.txt:

PROT1A1 PROT1B2
PROT1B2 PROT1A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2

and in rest.txt:

PROT1B6 PROT1A5
PROT2B2 PROT2A1

Upon very, special request here a sketch how to also not shadow partial duplicate pairs:

As usual, real life brings in "duplicates", so upon special request by the OP, here one final variant, dealing with duplicate first or second column tokens.

One could have also just stored the full pair as string in a set, but keeping the dict on input (with the nice setdefault method and targeting a list of values) is a more adequate approach IMO.

In below variant two other things are changed from the others:

The output is collected as pairs (in tuples) appended to lists
The output now has source column ordering (the other variants did it the other way around)

Sample (partially) duplicate data dealt with:

PROT1A1 PROT1B1
PROT1A1 PROT2B1

Source code:

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f:  # read file in context secured block
    for line in i_f:
        line = line.rstrip()  # no trailing white space (includes 
 et al.)
        if line:
            pair = [a.strip() for a in line.split('	')]  # fragile split
            # Build a dict with list as values
            # ... to keep same key, different value pairs
            a_dict.setdefault(pair[0], []).append(pair[1])
            b_dict.setdefault(pair[1], []).append(pair[0])

must_match_once = sorted(  # Accept only unordered 1A-1B or 2A-2B pairs
    (sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6)  # Sample: 'PROT1A1' has '1A' on slice(4, 6)

shared, rest = [], []  # Store respective output in lists of pairs (tuples)
for key_a, seq_val_a in a_dict.items():
    for val_a in seq_val_a:
        # Below we prepare a canidate for matching against the unordered pair
        fix_pos_mc = sorted(x[fix_pos_slice] for x in (key_a, val_a))
        has_match = any(
            [must_match == fix_pos_mc for must_match in must_match_once])
        if has_match and b_dict.get(key_a) and val_a in b_dict.get(key_a):
            # Preserve first, second source order by appending in that order
            shared.append((key_a, val_a))
        else:
            rest.append((key_a, val_a))

# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:  # Again secured in context block
        for pair in data:
            o_f.write('	'.join(pair) + '
')

Operating on the given (degenerated) input.txt (note still expecting tabs between the words in a line!):

PROT1B2 PROT1A1
PROT1A1 PROT1B2  
PROT1A1 PROT2B2  
PROT1A5 PROT1B6   
PROT2A1 PROT2B2   
PROT1A2 PROT3B2
PROT3B2 PROT1A2

yields in out.txt:

PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2

and in rest.txt:

PROT1A1 PROT2B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2

Thus the "duplicate": PROT1A1 PROT2B2 is preserved and does not shadow the seeked PROT1A1 PROT1B2.

Previous update:

Now with better specified task from comment (maybe update the question also):

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
fix_pos_match_query = sorted(('1A', '1B'))
fix_pos_slice = slice(4, 6)  # sample 'PROT1A1' has '1A' on slice(4, 6)
with open(a_file_name, "rt") as i_f:
    for line in i_f:
        line = line.rstrip('
')
        if line:
            pair = [a.strip() for a in line.split('	')]
            fix_pos_match_cand = sorted(x[fix_pos_slice] for x in pair)
            if fix_pos_match_query == fix_pos_match_cand:
                a_dict[pair[0]] = pair[1]
                b_dict[pair[1]] = pair[0]

shared, rest = {}, {}
for key_a, val_a in a_dict.items():
    if b_dict.get(key_a) == val_a:
        shared[val_a] = key_a
    else:
        rest[val_a] = key_a

for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:
        for key, val in data.items():
            o_f.write(key + '	' + val + '
')

Operating on the given input.txt:

PROT1B2 PROT1A1
PROT1A1 PROT1B2 
PROT1A5 PROT1B6  
PROT2A1 PROT2B2

yields in out.txt:

PROT1A1 PROT1B2
PROT1B2 PROT1A1

and in rest.txt:

PROT1B6 PROT1A5

This works by expecting the given 2 characters on the exact positions as understood from the samples and the comment.

If this is a protein domain specific task with further variations (e.g. match not on fixed position) a regex would be for sure better, but the question does not provide such a variability. In case it is needed, simply replace the filter upon input line with one working on the match (or not match) of a regex.

Old:

First answer with simple full match filter:

A trial for an answer - or at least some hints on how to write the code more pep8 like readable for others - and with a simple filter based on the requested existence of a specific pair upon input:

a_file_name = "input.txt"
a_dict, b_dict = {}, {}
filter_on_read = sorted(('PROT1B2', 'PROT1A1'))
with open(a_file_name, "rt") as i_f:
    for line in i_f:
        line = line.rstrip('
')
        if line:
            pair = [a.strip() for a in line.split('	')]
            if filter_on_read == sorted(pair):
                a_dict[pair[0]] = pair[1]
                b_dict[pair[1]] = pair[0]

shared, rest = {}, {}
for key_a, val_a in a_dict.items():
    if b_dict.get(key_a) == val_a:
        shared[val_a] = key_a
    else:
        rest[val_a] = key_a

for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
    with open(f_name, 'wt') as o_f:
        for key, val in data.items():
            o_f.write(key + '	' + val + '
')

on my machine given:

PROT1B2 PROT1A1
PROT1A1 PROT1B2 
PROT1A5 PROT1B6  
PROT2A1 PROT2B2

yields in out.txt:

PROT1A1 PROT1B2
PROT1B2 PROT1A1

and in rest.txtnothing (for this input) as the rest was filtered out upon read already.

Please note: There will be more elegant versions esp. when reading huge files ... and we might miss data, as we only - like the questions code - loop over one side of the mapping a map from first to second entry, if there are duplicate first entries but with differing second entries the last will override the data from the previous read, and the former will never been output.

So there may be entries in b_dict that you will never see in the output files.

HTH python is a wonderful language ;-)

Find shared IDs with specific pattern

Answers (1)

Upon special request here a sketch how to extend to more pairs to match:

Upon very, special request here a sketch how to also not shadow partial duplicate pairs:

Related Questions