Reputation: 1151
I have two big files. I want to find common names in column 1 and column 2 of file1 and file2, respectively. The script below does it. Problem: I want to print also corresponding data from file1 in output, but it does not work. How to fix it? file1.txt
GRMZM5G888627_P01 GO:0003674 molecular_function
GRMZM5G888620_P01 GO:0008150 biological_process
GRMZM5G888625_P03 GO:0008152 metabolic process
file2.txt
contig1 GRMZM5G888627_P01
contig2 AT2G41790.1
contig3 GRMZM5G888625_P03
Desired output,
contig1 GRMZM5G888627_P01 GO:0003674 molecular_function
contig3 GRMZM5G888625_P03 GO:0008152 metabolic process
Script,
f1=open('file1.txt','r')
f2=open('file2.txt','r')
output = open('result.txt','w')
dictA= dict()
for line1 in f1:
listA = line1.rstrip('\n').split('\t')
dictA[listA[0]] = listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
if subject in dictA:
output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()
Upvotes: 1
Views: 72
Reputation: 8692
try this,
ins = open('file1.txt', "r" )
values=''
dict={}
for line in ins:
arrayline=line.split()
dict[arrayline[0]]='\t'.join(arrayline)
file2=open('file2.txt', "r" )
output = open('result.txt','w')
for line in file2:
array2=line.split()
try:
v=dict[array2[1]]
output.write('\n'+array2[0]+'\t'+v)
except:
pass
output.close()
Upvotes: 1
Reputation: 2437
Inside the
for line1 in f2:
listA isn't going to be mapped to the associated f2 line. You stored them in dictA.
Once you test if the subject is in dictA, you need to retrieve the proper listA
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
if subject in dictA:
listA = dictA[subject]
output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()
I don't understand why you are appending to new_list
in here:
query=new_list[0]
subject=new_list[1]
new_list.append(query)
new_list.append(subject)
When processing the first line, you read in:
contig1 GRMZM5G888627_P01
Into new_list
, giving you essentially:
new_list == ['contig1', 'GRMZM5G888627_P01']
Then you set query
and subject
to the two items in the list. Then append them back onto it, giving you:
new_list == ['contig1', 'GRMZM5G888627_P01', 'contig1', 'GRMZM5G888627_P01']
Which you never use. You should be able to just have:
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
subject=new_list[1]
if subject in dictA:
listA = dictA[subject]
output.writelines(new_list[0] + '\t' + subject + '\t' + str(listA[1]) + str(listA[2]) + '\n')
output.close()
Also you are only writing 1 line, so output.write
is fine. And string addition is usually bad, so replaced by format. Your listA
stored strings, so I eliminated the str()
call.
for line1 in f2:
new_list=line1.rstrip('\n').split('\t')
subject=new_list[1]
if subject in dictA:
listA = dictA[subject]
output.write("{}\t{}\t{}{}\n".format(new_list[0], subject, listA[1], listA[2])
output.close()
Upvotes: 2
Reputation: 180401
use sets
In [1]: list1=[1,2,3,4,5,6,7,8,9]
In [2]: list2=[1,2,3,10,11,12,13]
In [3]: list1=set(list1)
In [4]: list1.intersection(list2)
Out[4]: {1, 2, 3}
Upvotes: 0