Reputation: 1151
I have two files like this, A list of proteins -
TRIUR3_05947-P1
TRIUR3_06394-P1
Traes_1BL_EB95F4919.2
And a dictionary of tab-delimited contigs and proteins -
contig22 TRIUR3_05947-P1
contig15 TRIUR3_05947-P1
contig1 Traes_1BL_EB95F4919.2
contig67 Traes_1BL_EB95F4919.2
contig98 Traes_1BL_EB95F4919.2
contig45 MLOC_71599.4
My desired output is that it finds all common proteins and prints me results like this,
contig22 TRIUR3_05947-P1
contig15 TRIUR3_05947-P1
contig1 Traes_1BL_EB95F4919.2
contig67 Traes_1BL_EB95F4919.2
contig98 Traes_1BL_EB95F4919.2
This is my script below, but it gives me the result of the common key just ones, I guess overriding over, how can this be solved?
f1=open('mydict.txt','r')
f2=open('mylist.txt','r')
output = open('result.txt','w')
dictA= dict()
for line1 in f1:
listA = line1.rstrip('\r\n').split('\t')
dictA[listA[1]] = listA[0]
for line1 in f2:
new_list=line1.rstrip('\n').split()
query=new_list[0]
if query in dictA:
listA[0] = dictA[query]
output.write(query+'\t'+str(listA[0])+'\n')
Upvotes: 1
Views: 701
Reputation: 1128
You do this the wrong way around. If you store the 'dictionary file' in a dictionary structure, using the protein names as keys, you will lose information.
A better way to do this, would be to read the list of proteins first, and store all the protein names in a set. Then, you read the dictionary file and print all lines whose protein name is in the set.
with open('mylist.txt') as mylist:
proteins = set(line.strip() for line in mylist)
with open('mydict.txt') as mydict, open('result.txt', 'w') as output:
for line in mydict:
_, protein = line.strip().split()
if protein in proteins:
output.write(line)
Upvotes: 1
Reputation: 341
In your first for loop, you are losing information as you transform the txt file into a python dictionary:
for ...:
dictA[listA[1]] = listA[0]
For example, if you have the lines
contig1 Traes_1BL_EB95F4919.2
contig67 Traes_1BL_EB95F4919.2
contig98 Traes_1BL_EB95F4919.2
in you txt file, the resulting dictionary will only have the key-value pair of the last entry, reversed.
To achieve you goal, with minimal modifications of your program, try
from collections import defaultdict
f1=open('mydict.txt','r')
f2=open('mylist.txt','r')
output = open('result.txt','w')
dictA= defaultdict(list)
for line1 in f1:
listA = line1.rstrip('\r\n').split('\t')
dictA[listA[1]].append(listA[0]) # Save all the common proteins
for line1 in f2:
new_list=line1.rstrip('\n').split()
query=new_list[0]
if query in dictA:
listA = dictA[query] # Now have a list of matching contigs
for contig in listA:
output.write(contig + '\t' + query +'\n')
Upvotes: 1