Guerin Claire
Guerin Claire

Reputation: 1

Find common items in 2 file text

To introduce you to the context of my problem: I have two files containing information about genes:

pos.bed contains positions of specific genes and hg19-genes.txt contains all the existing genes of the species, with some indicated characters such as the position of the genes (start and end), its name, its symbol, etc.

The problem is that in pos, only the position of the gene is indicated, but not its name/symbol. I would like to read through both files and compare the start and end in each line. If there is a match, I would like to get the symbol of the corresponding gene.

I wrote this little python code:

pos=open('C:/Users/Claire/Desktop/Arithmetics/pos.bed','r')
gen=open('C:/Users/Claire/Desktop/Arithmetics/hg19-genes.txt','r')

for row in pos:
    row=row.split()
    start=row[11]
    end=row[12]
    for row2 in gen:
        row2=row2.split()
        start2=row2[3]
        end2=row2[4]
        sym=row2[10]
        if start==start2 and end==end2:
        print sym

pos.close()
gen.close()

But it seems like this is only comparing the two files line by line (like line 2 in file pos with line 2 in file gen only).So I tried adding else to the if loop but I get an error message:

    else:
        gen.next()

StopIteration                             Traceback (most recent call last)
<ipython-input-9-a309fdca7035> in <module>()
     14             print sym
     15         else:
---> 16             gen.next()
     17 
     18 pos.close()

StopIteration:

I know it is possible to compare all the lines of 2 files, no matter the position of the line, by doing something like:

same = set(file1).intersection(file2)

but in my case I only want to compare some columns of each line as the lines have different information in each file (except for the start and the end). Is there a similar way to compare lines in files, but only for some specified items?

Upvotes: 0

Views: 50

Answers (1)

tommi
tommi

Reputation: 114

gen is an iterator that iterates over the lines of the file exactly once, that is, when processing the first row in pos. The simplest workaround for that is to open the gen file inside the outer loop:

pos=open('C:/Users/Claire/Desktop/Arithmetics/pos.bed','r')

for row in pos:
    row=row.split()
    start=row[11]
    end=row[12]
    gen=open('C:/Users/Claire/Desktop/Arithmetics/hg19-genes.txt','r')
    for row2 in gen:
        row2=row2.split()
        start2=row2[3]
        end2=row2[4]
        sym=row2[10]
        if start==start2 and end==end2:
        print sym
    gen.close() 

pos.close()

Another option would be to read all lines of gen into a list and use that list in the inner loop.

Upvotes: 1

Related Questions