user3224522
user3224522

Reputation: 1151

Intersecting regions of two files and printing combined result

I have two big files. I want to find common names in column 1 and column 2 of file1 and file2, respectively. The script below does it. Problem: I want to print also corresponding data from file1 in output, but it does not work. How to fix it? file1.txt

GRMZM5G888627_P01   GO:0003674  molecular_function
GRMZM5G888620_P01   GO:0008150  biological_process
GRMZM5G888625_P03   GO:0008152  metabolic process

file2.txt

contig1 GRMZM5G888627_P01
contig2 AT2G41790.1
contig3 GRMZM5G888625_P03

Desired output,

contig1 GRMZM5G888627_P01  GO:0003674   molecular_function
contig3 GRMZM5G888625_P03  GO:0008152   metabolic process

Script,

f1=open('file1.txt','r')
f2=open('file2.txt','r')
output = open('result.txt','w')

dictA= dict() 
for line1 in f1:
   listA = line1.rstrip('\n').split('\t')
   dictA[listA[0]] = listA

for line1 in f2:
    new_list=line1.rstrip('\n').split('\t')
    query=new_list[0]
    subject=new_list[1]
    new_list.append(query)
    new_list.append(subject)
    if subject in dictA:
       output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()

Upvotes: 1

Views: 72

Answers (3)

sundar nataraj
sundar nataraj

Reputation: 8692

try this,

 ins = open('file1.txt', "r" )
    values=''
    dict={}
    for line in ins:
        arrayline=line.split()

        dict[arrayline[0]]='\t'.join(arrayline)


    file2=open('file2.txt', "r" )
    output = open('result.txt','w')
    for line in file2:
        array2=line.split()
        try:
            v=dict[array2[1]]
            output.write('\n'+array2[0]+'\t'+v)


        except:
            pass


    output.close()

Upvotes: 1

Joe
Joe

Reputation: 2437

Inside the

for line1 in f2:

listA isn't going to be mapped to the associated f2 line. You stored them in dictA.

Once you test if the subject is in dictA, you need to retrieve the proper listA

for line1 in f2:
    new_list=line1.rstrip('\n').split('\t')
    query=new_list[0]
    subject=new_list[1]
    new_list.append(query)
    new_list.append(subject)
    if subject in dictA:
        listA = dictA[subject]
        output.writelines(query+'\t'+subject+'\t'+str(listA[1])+str(listA[2])+'\n')
output.close()

I don't understand why you are appending to new_list in here:

    query=new_list[0]
    subject=new_list[1]
    new_list.append(query)
    new_list.append(subject)

When processing the first line, you read in:

contig1 GRMZM5G888627_P01

Into new_list, giving you essentially:

new_list == ['contig1', 'GRMZM5G888627_P01']

Then you set query and subject to the two items in the list. Then append them back onto it, giving you:

new_list == ['contig1', 'GRMZM5G888627_P01', 'contig1', 'GRMZM5G888627_P01']

Which you never use. You should be able to just have:

for line1 in f2:
    new_list=line1.rstrip('\n').split('\t')
    subject=new_list[1]
    if subject in dictA:
        listA = dictA[subject]
        output.writelines(new_list[0] + '\t' + subject + '\t' + str(listA[1]) + str(listA[2]) + '\n')
output.close()

Also you are only writing 1 line, so output.write is fine. And string addition is usually bad, so replaced by format. Your listA stored strings, so I eliminated the str() call.

for line1 in f2:
    new_list=line1.rstrip('\n').split('\t')
    subject=new_list[1]
    if subject in dictA:
        listA = dictA[subject]
        output.write("{}\t{}\t{}{}\n".format(new_list[0], subject, listA[1], listA[2])
output.close()

Upvotes: 2

Padraic Cunningham
Padraic Cunningham

Reputation: 180401

use sets

In [1]: list1=[1,2,3,4,5,6,7,8,9]

In [2]: list2=[1,2,3,10,11,12,13]

In [3]: list1=set(list1)

In [4]: list1.intersection(list2)
Out[4]: {1, 2, 3}

Upvotes: 0

Related Questions