Reputation: 575
Before I get into my question, I would like to provide you guys with what I have thus far.
First, I generated a nested dictionary from a file that I would like to use for comparison. An example of what my dictionary looks like is pictured below (with the only difference being the size):
Negdic = {'ADA': {'NM_000022': ['43248162', '43248939',
'43249658', '43251228',
'43251469', '43251647',
'43252842', '43254209',
'43255096', '43257687',
'43264867', '43280215', '']},
'ALDOB': {'NM_000035': ['104182841', '104187124',
'104187734', '104188836',
'104189763', '104190750',
'104192036', '104193057',
'104197990', '']}}
Now this is where I'm struggling due to me being unfamiliar with Python and new to programing. I would like to use a second file to search through my dictionary for matches. My file looks as so:
chrom exon_start exon_end strand isoform exon_numer gene coding_length total_mutations_reported total_exonic_mutations exonic_splicing_mutations total_splice_site_mutations 3_ss_mutations 5_ss_mutations
chr20 43255096 43255240 - NM_000022 4 ADA 144 12 9 0 3 3 0
chr9 104187734 104187909 - NM_000035 7 ALDOB 175 7 4 0 3 2 1
What I want to do is search through my dictionary for the gene name, then match the isoform name, and then lastly search through the corresponding isoform list for the exon_start and print the position in the list where the exon_start was found.
Here is some example code that I've been trying to work with, but I'm not sure if I'm on the right track.
for line in open("NegativeHotspot.txt"):
columns = line.split('\t')
if len(columns) >= 2:
Hotspotgenes = columns[6]
Hotspotgenes2 = Hotspotgenes.split()
print Hotspotgenes2
#print Hotspotgenes2
#x = type(Hotspotgenes)
#print x
#for k in Hotspotgenes:
# if k in Negdic:
# print k, Negdic[k]
The first part is something I've been trying to mess with to create a list of the genes in the file to search the dictionary for my results, but I'm struggling to even create a list from my output of columns[6]. Plus, I'm not even sure if I'm tackling my code in the best possible way. The last part of that coding section was something I was just messing with in an attempt to find a match in my dictionary.
Help would be greatly appreciated. I'm so lost :(
Upvotes: 1
Views: 105
Reputation: 1649
I will try to put you in the right track and point out a couple of things that could be useful for you in the future. When opening a file you are better using the "with" argument as this will close the file for you when you're done. So do something like:
with open('eggs.csv', 'rb') as csvfile:
... spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
... for row in spamreader:
... print ', '.join(row)
Then what that guy is doing is creating a Python generator. Without going into details you need to bear in mind that when using an generator, you can iterate through your object only once. So it isn't like iterating through a list or a dictionary. So if you want to search for something else, you might need to run all your file again. To solve this, you could save your data into a more useful object like a list of lists where each row would be a list and then all your file would be a list of those lists.
Then you could create a header and parse your lists into a dictionary that you can index. So if I have a csv of the type:
fruits, vegetables, cars
banana, cucumber, audi
An option would be to have a list of dictionaries so each row would look like: {'fruits': 'banana', 'vegetables': 'cucumber', ...}. So this is better to index but perhaps not as compact as the list of lists. At the end I would recommend you to bear in mind how each object performs in Big O times because it will make a difference if your data set is large.
The problem with dictionaries is that they are great to search through their keys, but if you want to search banana in the example I showed you, it won't be efficient. You would have to iterate through the whole data looking for the dict with banana on it as a value.
Upvotes: 1
Reputation: 54273
You have a tab-separated value file, so you should use the module dedicated to delimited file formats, csv
.
import csv
You also have headers with meaningful names. It'd be way easier to understand doing row[header_name]
than row[col_number]
, so let's use csv.DictReader
with open("NegativeHotspot.txt") as f:
reader = csv.DictReader(f, delimiter="\t")
Now we can iterate through each row of reader
and pull out the info you need using the list.index
method
for row in reader:
gene, isoform = row['gene'], row['isoform']
count = Negdic[gene][isoform].index(row['exon_start'])
You never say what your end-result is with the count
variable, but count
is now the index where exon_start
occurs in your Negdic[gene][isoform]
dictionary.
Upvotes: 1