Reputation: 43

extract substring pattern

I have long file like 1200 sequences

>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP


>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL

I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx

the output should be like this:

QDIQLCGMGIL
ILPEHCIIDIT
TISDNCVVIFS
FSKTSCSYCTM

this is the pogram only give position of C . it is not work like what I want

pos=[]

def find(ch,string1):

    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
            return pos



z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')

print z

Upvotes: 2

Answers (2)

Padraic Cunningham

Reputation: 180441

You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:

def find(ch,string1):  
    pos = []
    for i in range(len(string1)):
        if ch == string1[i]:
            pos.append(i)
    return pos # outside

You can also use enumerate with a list comp in place of your range logic:

def indexes(ch, s1):  
    return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]

Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.

If you want the five chars that are both sides:

In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"

In [25]: inds = indexes("C",s)

In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']

I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.

You can do it all in a single function, yielding a slice when you find a match:

def find(ch, s):
    ln = len(s)
    for i, char in enumerate(s):
        if ch == char and 5 <= i <= ln - 6:
            yield s[i- 5:i + 6]

Where presuming the data in your question is actually two lines from yoru file like:

s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""

Running:

for line in s.splitlines():
    print(list(find("C" ,line)))

would output:

['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.

You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match

def find(ch, s):
    ln, i = len(s) - 6, s.find(ch)
    while 5 <= i <= ln:
        yield s[i - 5:i + 6]
        i = s.find(ch, i + 1)

Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.

Upvotes: 2

Adib

Reputation: 1334

My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to @Smac89 for improving it by transforming it into a generator:

import re

string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP

LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""

# Generator
def find_cysteine2(string):

    # Create a loop that will utilize regex multiple times
    # in order to capture matches within groups
    while True:
        # Find a match
        data = re.search(r'(\w{5}C\w{5})',string)

        # If match exists, let's collect the data
        if data:
            # Collect the string
            yield data.group(1)

            # Shrink the string to not include 
            # the previous result
            location = data.start() + 1
            string = string[location:]

        # If there are no matches, stop the loop
        else:
            break

print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Upvotes: 1

extract substring pattern

Answers (2)

Related Questions