Reputation: 704
I am trying to do some interval range operations on two files with folowing conditions check if chrom is equal, then check if my start and end of of co0rdinatefile is within or equal to start and end of gene_annotation file (condition if the strand is "+" the start and end will be for eg 10-20, if its "-" it will be 20-10) , if match print start end strand from coordinate and gene_id, gene_name from geneannotation file. (for representation purpose I have head annoataion file )
number of rows in annotation file ~50000 number of rows in coordinated file ~200,000
gene_annotationfile
chrom start end gene_id gene_name strand
17 71223692 71274336 ENSMUSG00000085299 Gm16627 -
17 18186448 18211184 ENSMUSG00000067978 Vmn2r-ps113 +
11 84645863 84684319 ENSMUSG00000020530 Ggnbp2 -
7 51097639 51106551 ENSMUSG00000074155 Klk5 +
13 31711037 31712238 ENSMUSG00000087276 Gm11378 +
coordinates_file
chrom start end strand
1 4247322 4247912 -
1 4427449 4432604 +
1 4763414 4764404 -
1 4764597 4767606 -
1 4764597 4766491 -
1 4766882 4767606 -
1 4767729 4772649 -
1 4767729 4768829 -
1 4767729 4775654 -
1 4772382 4772649 -
1 4772814 4774032 -
1 4772814 4774159 -
1 4772814 4775654 -
1 4772814 4774032 +
1 4774186 4775654 -
1 4774186 4775654
1 4774186 4775699 -
desired output
chrom, start, end,strand, gene_id, gene_name
1 4427432 4432686 + ENSMUSG0001 abcd
Another problem is in some cases if there is a match it may map to to gene_id in that case i would like to write
chrom, start, end,strand, gene_id, gene_name
1 4427432 4432686 + ENSMUSG0001,ENSMUSG0002 abcd,efgh
my code so far:
import csv
with open('coordinates.txt', 'r') as source:
coordinates = list(csv.reader(source, delimiter="\t"))
with open('/gene_annotations.txt', 'rU') as source:
#if i do not use 'rU' i get this error Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
annotations = list(csv.reader(source, delimiter="\t"))
for index,line in enumerate(coordinates):
for index2, line2 in enumerate(annotations):
if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
break
error that I get
---> 15 if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
16 print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
17 break
TypeError: list indices must be integers, not list
will pandas be a good approach for this?
Upvotes: 0
Views: 742
Reputation: 3555
There are tools dedicated this this such as bedtools
intersect;
https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
And Bedops intersect;
https://bedops.readthedocs.io/en/latest/content/reference/set-operations/bedops.html
Upvotes: 0
Reputation: 26
I assume coordinates is a list of lists, like [[1,2],[3,4]] . The line
for index,line in enumerate(coordinates):
iterates over coordinates, returning each line of coordinates as line and the index as index.
if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
The error message means that you are using a list (line) for the index here. You likely wanted to use index instead of line:
if coordinates[index][0] == annotations[index2][0] and coordinates[index][1] <= annotations[index2][1] and annotations[index2][2] >= coordinates[index][2] :
Even better would be to just use line:
if line[0] == line2[0] and line[1] <= line2[1] and line2[2] >= line[2] :
see https://docs.python.org/2.7/reference/compound_stmts.html?highlight=for_stmt#grammar-token-for_stmt
Upvotes: 1