Reputation: 151
I'm working on a project on analysis of PSL files. The program overall looks at readpairs and identifies circular molecules. I've got the program working, but the fact that my operations are nested makes it very inefficient taking longer than 10 minutes to read through the whole PSL file instead of ~15seconds like it should.
The relative code is:
def readPSLpairs(self):
posread = []
negread = []
result = {}
for psl in self.readPSL():
parsed = psl.split()
strand = parsed[9][-1]
if strand == '1':
posread.append(parsed)
elif strand == '2':
negread.append(parsed)
for read in posread:
posname = read[9][:-2]
poscontig = read[13]
for read in negread:
negname = read[9][:-2]
negcontig = read[13]
if posname == negname and poscontig == negcontig:
try:
result[poscontig] += 1
break
except:
result[poscontig] = 1
break
print(result)
I have attempted changing the overall operation to instead append the values to lists and attempt to then match posname = negname and poscontig = negcontig, but it proves to be much harder than I thought it would, so I'm stuck on trying to improve the functionality of it all.
Upvotes: 0
Views: 77
Reputation: 54193
import collections
all_dict = {"pos": collections.defaultdict(int),
"neg": collections.defaultdict(int)}
result = {}
for psl in self.readPSL():
parsed = pls.split()
strand = "pos" if parsed[9][-1]=='1' else "neg"
name, contig = parsed[9][:-2], parsed[13]
all_dict[strand][(name,contig)] += 1
# pre-process all the psl's into all_dict['pos'] or all_dict['neg']
# this is basically just a `collections.Counter` of what you're doing already!
for info, posqty in all_dict['pos'].items():
negqty = all_dict['neg'][info] # (defaults to zero)
result[info] = qty * other_qty
# process all the 'pos' psl's. For every match with a 'neg', set
# result[(name, contig)] to the total (posqty * negqty)
Note that this throws away the whole parsed psl value, keeping only the name
and contig
slices.
Upvotes: 1