find overlaps/range between two datasets using pandas or enumerate

Question

I am trying to do some interval range operations on two files with folowing conditions check if chrom is equal, then check if my start and end of of co0rdinatefile is within or equal to start and end of gene_annotation file (condition if the strand is "+" the start and end will be for eg 10-20, if its "-" it will be 20-10) , if match print start end strand from coordinate and gene_id, gene_name from geneannotation file. (for representation purpose I have head annoataion file )

number of rows in annotation file ~50000 number of rows in coordinated file ~200,000

gene_annotationfile

chrom     start       end             gene_id    gene_name strand
17  71223692  71274336  ENSMUSG00000085299      Gm16627      -
17  18186448  18211184  ENSMUSG00000067978  Vmn2r-ps113      +
11  84645863  84684319  ENSMUSG00000020530       Ggnbp2      -
 7  51097639  51106551  ENSMUSG00000074155         Klk5      +
13  31711037  31712238  ENSMUSG00000087276      Gm11378      +

coordinates_file

  chrom start   end strand
  1 4247322 4247912 -
  1 4427449 4432604 +
  1 4763414 4764404 -
  1 4764597 4767606 -
  1 4764597 4766491 -
  1 4766882 4767606 -
  1 4767729 4772649 -
  1 4767729 4768829 -
  1 4767729 4775654 -
  1 4772382 4772649 -
  1 4772814 4774032 -
  1 4772814 4774159 -
  1 4772814 4775654 -
  1 4772814 4774032 +
  1 4774186 4775654 -
  1 4774186 4775654 
  1 4774186 4775699 -

desired output

 chrom, start, end,strand, gene_id, gene_name
 1      4427432 4432686 + ENSMUSG0001 abcd

Another problem is in some cases if there is a match it may map to to gene_id in that case i would like to write

 chrom, start, end,strand, gene_id, gene_name
 1      4427432 4432686 + ENSMUSG0001,ENSMUSG0002 abcd,efgh

my code so far:

 import csv 

 with open('coordinates.txt', 'r') as source:
      coordinates = list(csv.reader(source, delimiter="	"))

 with open('/gene_annotations.txt', 'rU') as source:
      #if i do not use 'rU' i get this error Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
      annotations = list(csv.reader(source, delimiter="	"))

for index,line in enumerate(coordinates):

    for index2, line2 in enumerate(annotations):


        if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
         print "%s	%s	%s	%s	%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
         break

error that I get

---> 15         if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
16              print "%s	%s	%s	%s	%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
17              break

TypeError: list indices must be integers, not list

will pandas be a good approach for this?

tobbes76 · Accepted Answer

I assume coordinates is a list of lists, like [[1,2],[3,4]] . The line

for index,line in enumerate(coordinates):

iterates over coordinates, returning each line of coordinates as line and the index as index.

if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :

The error message means that you are using a list (line) for the index here. You likely wanted to use index instead of line:

if coordinates[index][0] == annotations[index2][0] and coordinates[index][1] <= annotations[index2][1] and annotations[index2][2] >= coordinates[index][2] :

Even better would be to just use line:

if line[0] == line2[0] and line[1] <= line2[1] and line2[2] >= line[2] :

see https://docs.python.org/2.7/reference/compound_stmts.html?highlight=for_stmt#grammar-token-for_stmt

find overlaps/range between two datasets using pandas or enumerate

Answers (2)

Related Questions