Reputation: 1151
I have a tab-delimited file with 100k lines:(EDITED)
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
I want to get ID's that match in both directions and have 1A-1B, 1B-1A, save in one file, the rest in another,so:
out.txt) PROT1B2 PROT1A1 rest.txt) PROT1A5 PROT1B6
PROT1A1 PROT1B2 PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
My script gives me bidirectional IDs, but I don't know to to find for specific pattern, with re? I appreciate if you comment on your script, so I can understand and modify it..
fileA = open("input.txt",'r')
fileB = open("input_copy.txt",'r')
output = open("out.txt",'w')
out2=open("rest.txt",'w')
dictA = dict()
for line1 in fileA:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
dictA[query] = subject
dictB = dict()
for line1 in fileB:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
dictB[query] = subject
SharedPairs ={}
NotSharedPairs ={}
for id1 in dictA.keys():
value1=dictA[id1]
if value1 in dictB.keys():
if id1 == dictB[value1]: # may be re should go here?
SharedPairs[value1] = id1
else:
NotSharedPairs[value1] = id1
for key in SharedPairs.keys():
line = key +'\t' + SharedPairs[key]+'\n'
output.write(line)
for key in NotSharedPairs.keys():
line = key +'\t' + NotSharedPairs[key]+'\n'
out2.write(line)
Upvotes: 1
Views: 129
Reputation: 3345
For the final specification in question here comes the answer suggestion:
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f: # read file in context secured block
for line in i_f:
line = line.rstrip() # no trailing white space (includes \n et al.)
if line:
pair = [a.strip() for a in line.split('\t')] # fragile split
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
must_match = sorted(('1A', '1B')) # Accept only unordered 1A-1B pairs
fix_pos_slice = slice(4, 6) # Sample: 'PROT1A1' has '1A' on slice(4, 6)
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
# Below we prepare a canidate for matching against the unordered pair
fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
if must_match == fix_pos_match_cand and b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f: # Again secured in context block
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
Operating on the given (new) input.txt
(expecting tabs between the words in a line!):
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
yields in out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
and in rest.txt
:
PROT1B6 PROT1A5
PROT2B2 PROT2A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2
Comments have been added to highlight some code portions.
To use the same input file (but demonstrate different result) add a hypothetical 1A-3B pair (the 2A-2B is not present in input) to the allowed matches goes like this (one solution):
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f: # read file in context secured block
for line in i_f:
line = line.rstrip() # no trailing white space (includes \n et al.)
if line:
pair = [a.strip() for a in line.split('\t')] # fragile split
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
must_match_once = sorted( # Accept only unordered 1A-1B or 1A-3B pairs
(sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6) # Sample: 'PROT1A1' has '1A' on slice(4, 6)
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
# Below we prepare a canidate for matching against the unordered pair
fix_pos_match_cand = sorted(x[fix_pos_slice] for x in (key_a, val_a))
has_match = any(
[must_match == fix_pos_match_cand for must_match in must_match_once])
if has_match and b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f: # Again secured in context block
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
Operating on the given (identical) input.txt
(note still expecting tabs between the words in a line!):
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
yields in out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
PROT3B2 PROT1A2
PROT1A2 PROT3B2
and in rest.txt
:
PROT1B6 PROT1A5
PROT2B2 PROT2A1
As usual, real life brings in "duplicates", so upon special request by the OP, here one final variant, dealing with duplicate first or second column tokens.
One could have also just stored the full pair as string in a set, but keeping the dict on input (with the nice setdefault method and targeting a list of values) is a more adequate approach IMO.
In below variant two other things are changed from the others:
Sample (partially) duplicate data dealt with:
PROT1A1 PROT1B1
PROT1A1 PROT2B1
Source code:
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
with open(a_file_name, "rt") as i_f: # read file in context secured block
for line in i_f:
line = line.rstrip() # no trailing white space (includes \n et al.)
if line:
pair = [a.strip() for a in line.split('\t')] # fragile split
# Build a dict with list as values
# ... to keep same key, different value pairs
a_dict.setdefault(pair[0], []).append(pair[1])
b_dict.setdefault(pair[1], []).append(pair[0])
must_match_once = sorted( # Accept only unordered 1A-1B or 2A-2B pairs
(sorted(pair) for pair in (('1A', '1B'), ('1A', '3B'))))
fix_pos_slice = slice(4, 6) # Sample: 'PROT1A1' has '1A' on slice(4, 6)
shared, rest = [], [] # Store respective output in lists of pairs (tuples)
for key_a, seq_val_a in a_dict.items():
for val_a in seq_val_a:
# Below we prepare a canidate for matching against the unordered pair
fix_pos_mc = sorted(x[fix_pos_slice] for x in (key_a, val_a))
has_match = any(
[must_match == fix_pos_mc for must_match in must_match_once])
if has_match and b_dict.get(key_a) and val_a in b_dict.get(key_a):
# Preserve first, second source order by appending in that order
shared.append((key_a, val_a))
else:
rest.append((key_a, val_a))
# Output shared and rest into corresponding files
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f: # Again secured in context block
for pair in data:
o_f.write('\t'.join(pair) + '\n')
Operating on the given (degenerated) input.txt
(note still expecting tabs between the words in a line!):
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A1 PROT2B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
yields in out.txt
:
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A2 PROT3B2
PROT3B2 PROT1A2
and in rest.txt
:
PROT1A1 PROT2B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
Thus the "duplicate": PROT1A1 PROT2B2
is preserved and does not shadow the seeked PROT1A1 PROT1B2
.
Previous update:
Now with better specified task from comment (maybe update the question also):
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
fix_pos_match_query = sorted(('1A', '1B'))
fix_pos_slice = slice(4, 6) # sample 'PROT1A1' has '1A' on slice(4, 6)
with open(a_file_name, "rt") as i_f:
for line in i_f:
line = line.rstrip('\n')
if line:
pair = [a.strip() for a in line.split('\t')]
fix_pos_match_cand = sorted(x[fix_pos_slice] for x in pair)
if fix_pos_match_query == fix_pos_match_cand:
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
if b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f:
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
Operating on the given input.txt
:
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
yields in out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
and in rest.txt
:
PROT1B6 PROT1A5
This works by expecting the given 2 characters on the exact positions as understood from the samples and the comment.
If this is a protein domain specific task with further variations (e.g. match not on fixed position) a regex would be for sure better, but the question does not provide such a variability. In case it is needed, simply replace the filter upon input line with one working on the match (or not match) of a regex.
Old:
First answer with simple full match filter:
A trial for an answer - or at least some hints on how to write the code more pep8 like readable for others - and with a simple filter based on the requested existence of a specific pair upon input:
a_file_name = "input.txt"
a_dict, b_dict = {}, {}
filter_on_read = sorted(('PROT1B2', 'PROT1A1'))
with open(a_file_name, "rt") as i_f:
for line in i_f:
line = line.rstrip('\n')
if line:
pair = [a.strip() for a in line.split('\t')]
if filter_on_read == sorted(pair):
a_dict[pair[0]] = pair[1]
b_dict[pair[1]] = pair[0]
shared, rest = {}, {}
for key_a, val_a in a_dict.items():
if b_dict.get(key_a) == val_a:
shared[val_a] = key_a
else:
rest[val_a] = key_a
for f_name, data in (("out.txt", shared), ("rest.txt", rest)):
with open(f_name, 'wt') as o_f:
for key, val in data.items():
o_f.write(key + '\t' + val + '\n')
on my machine given:
PROT1B2 PROT1A1
PROT1A1 PROT1B2
PROT1A5 PROT1B6
PROT2A1 PROT2B2
yields in out.txt
:
PROT1A1 PROT1B2
PROT1B2 PROT1A1
and in rest.txt
nothing (for this input) as the rest was filtered out upon read already.
Please note: There will be more elegant versions esp. when reading huge files ... and we might miss data, as we only - like the questions code - loop over one side of the mapping a map from first to second entry, if there are duplicate first entries but with differing second entries the last will override the data from the previous read, and the former will never been output.
So there may be entries in b_dict
that you will never see in the output files.
HTH python is a wonderful language ;-)
Upvotes: 1