Reputation: 17
Ok so ill get straight to the point here is my code
def digestfragmentwithenzyme(seqs, enzymes):
fragment = []
for seq in seqs:
for enzyme in enzymes:
results = []
prog = re.compile(enzyme[0])
for dingen in prog.finditer(seq):
results.append(dingen.start() + enzyme[1])
results.reverse()
#result = 0
for result in results:
fragment.append(seq[result:])
seq = seq[:result]
fragment.append(seq[:result])
fragment.reverse()
return fragment
Input for this function is a list of multiple strings (seq) e.g. :
List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
And enzymes as input:
[["TC", 1],["GC",1]]
(note: there can be multiple given but most of them are in this matter of letters with ATCG)
The function should return a list that, in this example, contain 2 lists:
Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]
Right now i am having troubles with splitting it twice and getting the right output.
Little bit more information about the function. It looks through the string (seq) for the recognizion point. in this case TC or GC and splits it on the 2nd index of enzymes. it should do that for both strings in the list with both enzymes.
Upvotes: 0
Views: 483
Reputation: 3841
Here is my solution:
Replace TC
with T C
, GC
with G C
(this is done based on index given) and then split based on space character....
def digest(seqs, enzymes):
res = []
for li in seqs:
for en in enzymes:
li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
r = li.split()
res.append(r)
return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)
the results are:
for ([["TC", 1],["GC",1]])
['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]
for ([["AAT", 2],["GC",1]])
['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]
Upvotes: 1
Reputation: 211942
Throwing my hat in the ring here.
.
import re
sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }
def intervals(patterns, text):
pattern = '|'.join(patterns.keys())
start = 0
for match in re.finditer(pattern, text):
index = match.start() + patterns.get(match.group())
yield text[start:index]
start = index
yield text[index:len(text)]
print [list(intervals(patterns, s)) for s in sequences]
# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
Upvotes: 0
Reputation: 13750
The simplest answer I can think of:
input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
parts = []
left = 0
for right in range(1,len(string)):
if string[right-1:right+1] in enzymes:
parts.append(string[left:right])
left = right
parts.append(string[left:])
output.append(parts)
print(output)
Upvotes: 0
Reputation: 250891
Use positive lookbehind and lookahead regex search:
import re
def digest_fragment_with_enzyme(sequences, enzymes):
pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
print pattern # prints ((?<=T)(?=C))|((?<=G)(?=C))
for seq in sequences:
indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
yield [seq[start: end] for start, end in zip(indices, indices[1:])]
seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))
Output:
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
Upvotes: 0
Reputation: 15423
Here is something that should work using regex. In this solution, I find all occurrences of your enzyme strings and split using their corresponding index.
def digestfragmentwithenzyme(seqs, enzymes):
out = []
dic = dict(enzymes) # dictionary of enzyme indices
for seq in seqs:
sub = []
pos1 = 0
enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
for match in re.finditer('('+enzstr+')', seq):
index = dic[match.group(0)]
pos2 = match.start()+index
sub.append(seq[pos1:pos2])
pos1 = pos2
sub.append(seq[pos1:])
out.append(sub)
# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
return out
Upvotes: 0
Reputation: 4551
Assuming the idea is to split at each enzyme, at the index point where enzymes are multiple letters, and the split, in essence comes between the two letters. Don't need regex.
You can do this by looking for the occurrences and inserting a split indicator at the correct index and then post-process the result to actually split.
For example:
def digestfragmentwithenzyme(seqs, enzymes):
# preprocess enzymes once, then apply to each sequence
replacements = []
for enzyme in enzymes:
replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
result = []
for seq in seqs:
for r in replacements:
seq = seq.replace(r[0], r[1]) # So AATTC becomes AATT|C
result.append(seq.split('|')) # So AATT|C becomes AATT, C
return result
def test():
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
print digestfragmentwithenzyme(seqs, enzymes)
Upvotes: 1