How to find a pattern among several different lines using python

Question

I have three lines like below - what I want to do is to print found if at least 5 characters of the last 15 characters in the upper line have | character under themselves (meaning these characters bind to the characters in the third line)

5'      TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
            ::     :::       ::        : :     : ||||| :          
3'          ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG

So far I have ended up with this script that can find specific patterns but I have not been able to expand it to find what I just mentioned in the question - it would be great if someone could help me to expand the code below - thank you.

import re

in_str = '''
5'      TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
            ::     :::       ::        : :     : ||||| :          
3'          ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''

in_str = re.sub(r'^\s*', "", in_str)
lst = re.split(r'
', in_str)

acgt = set(['A', 'C', 'G', 'T'])
for idx in range(min([len(s) for s in lst])):
    if lst[0][idx] in acgt and lst[1][idx] == '|' and lst[2][idx] in acgt:
        print('found!')
        break

Based on the structure of the three lines the output of the script should be found since the 5 characters (from the last 15 characters of the upper string) has | under themselves.

Roy · Accepted Answer

How about something like this. First slice the second line based on the length of first line and then count the occurrence of | in that slice.

in_str = '''
5'      TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
            ::     :::       ::        : :     : ||||| :          
3'          ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''

lines = in_str.strip().split("
")
first_line_length = len(lines[0]) # 62
valid_char_start_index = (first_line_length - 15 if first_line_length > 15 else 1) - 1  # 46

required_slice = lines[1][valid_char_start_index:first_line_length]  # : ||||| :   #
occurrence_of_pipe_in_slice = required_slice.count("|")  # 5
if occurrence_of_pipe_in_slice >= 5:
    print("found")

In case the mapping is done in the beginning as mentioned in the comment, we can do something like this:

import re

in_str = '''
5'     TCAGATGTGTATAAGAGACAGGTGTAATCGTTCCGCTTGAATGTACGTCATGAA
            |||||   ::  :    ::    ::  :       :       ::      
3' ATTTGCAGTACACTCGTTACCGAGGCATGACAGAGAATATGTGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''

lines = in_str.strip().split("
")


first_line_prefix = re.match(r"\d\'\s+", lines[0]).group()  # "5'     "
third_line_prefix = re.match(r"\d\'\s+",lines[2]).group()  # "3' "

if len(first_line_prefix) < len(third_line_prefix): # To make sure that which ever line starts first we are picking correct index
    valid_char_start_index = len(first_line_prefix)
else:
    valid_char_start_index = len(third_line_prefix)
# valid_char_start_index = 3

try:
    required_slice = lines[1][valid_char_start_index:valid_char_start_index+15]  # '         ||||| '
except IndexError:
    required_slice = lines[1][valid_char_start_index:]  # in case the second line is smaller than the required slice it
    # will consider to the end of the line
occurrence_of_pipe_in_slice = required_slice.count("|")  # 5
if occurrence_of_pipe_in_slice >= 5:
    print("found")

To Summarise we can create a function where can take input parameter as where we want to look for the mapping beginning or in the end. And then determine start and end indices accordingly. From there in both cases rest of the process is same.

How to find a pattern among several different lines using python

Answers (2)

Related Questions