Nathan Weesie
Nathan Weesie

Reputation: 17

python string splitting with multiple splitting points

Ok so ill get straight to the point here is my code

def digestfragmentwithenzyme(seqs, enzymes):

fragment = []
for seq in seqs:
    for enzyme in enzymes:
        results = []
        prog = re.compile(enzyme[0])
        for dingen in prog.finditer(seq):
           results.append(dingen.start() + enzyme[1])
        results.reverse()
        #result = 0
        for result in results:
            fragment.append(seq[result:])
            seq = seq[:result]
        fragment.append(seq[:result])
fragment.reverse()
return fragment

Input for this function is a list of multiple strings (seq) e.g. :

List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]

And enzymes as input:

[["TC", 1],["GC",1]]

(note: there can be multiple given but most of them are in this matter of letters with ATCG)

The function should return a list that, in this example, contain 2 lists:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]

Right now i am having troubles with splitting it twice and getting the right output.

Little bit more information about the function. It looks through the string (seq) for the recognizion point. in this case TC or GC and splits it on the 2nd index of enzymes. it should do that for both strings in the list with both enzymes.

Upvotes: 0

Views: 483

Answers (6)

PKey
PKey

Reputation: 3841

Here is my solution:

Replace TC with T C, GC with G C (this is done based on index given) and then split based on space character....

def digest(seqs, enzymes):
    res = []
    for li in seqs:
        for en in enzymes: 
            li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
        r = li.split()
        res.append(r)
    return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)

the results are:

for ([["TC", 1],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]

for ([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]

Upvotes: 1

Kenan Banks
Kenan Banks

Reputation: 211942

Throwing my hat in the ring here.

  • Using dict for patterns rather than list of lists.
  • Joining pattern as others have done to avoid fancy regexes.

.

import re

sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }

def intervals(patterns, text):
  pattern = '|'.join(patterns.keys())
  start = 0
  for match in re.finditer(pattern, text):
    index = match.start() + patterns.get(match.group())
    yield text[start:index]
    start = index
  yield text[index:len(text)]

print [list(intervals(patterns, s)) for s in sequences]

# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

Upvotes: 0

BallpointBen
BallpointBen

Reputation: 13750

The simplest answer I can think of:

input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
    parts = []
    left = 0
    for right in range(1,len(string)):
        if string[right-1:right+1] in enzymes:
            parts.append(string[left:right])
            left = right
    parts.append(string[left:])
    output.append(parts)
print(output)

Upvotes: 0

Ashwini Chaudhary
Ashwini Chaudhary

Reputation: 250891

Use positive lookbehind and lookahead regex search:

import re


def digest_fragment_with_enzyme(sequences, enzymes):
    pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
    print pattern  # prints ((?<=T)(?=C))|((?<=G)(?=C))
    for seq in sequences:
        indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
        yield [seq[start: end] for start, end in zip(indices, indices[1:])]

seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))

Output:

[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
 ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

Upvotes: 0

Julien Spronck
Julien Spronck

Reputation: 15423

Here is something that should work using regex. In this solution, I find all occurrences of your enzyme strings and split using their corresponding index.

def digestfragmentwithenzyme(seqs, enzymes):
    out = []
    dic = dict(enzymes) # dictionary of enzyme indices

    for seq in seqs:
        sub = []
        pos1 = 0

        enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
        for match in re.finditer('('+enzstr+')', seq):
            index = dic[match.group(0)]
            pos2 = match.start()+index
            sub.append(seq[pos1:pos2])
            pos1 = pos2
        sub.append(seq[pos1:])
        out.append(sub)
        # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
    return out

Upvotes: 0

pbuck
pbuck

Reputation: 4551

Assuming the idea is to split at each enzyme, at the index point where enzymes are multiple letters, and the split, in essence comes between the two letters. Don't need regex.

You can do this by looking for the occurrences and inserting a split indicator at the correct index and then post-process the result to actually split.

For example:

def digestfragmentwithenzyme(seqs, enzymes):
    # preprocess enzymes once, then apply to each sequence
    replacements = []
    for enzyme in enzymes:
        replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
    result = []
    for seq in seqs:
        for r in replacements:
            seq = seq.replace(r[0], r[1])   # So AATTC becomes AATT|C
        result.append(seq.split('|'))       # So AATT|C becomes AATT, C
    return result

def test():
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
    enzymes = [["TC", 1],["GC",1]]
    print digestfragmentwithenzyme(seqs, enzymes)

Upvotes: 1

Related Questions