Reputation: 1096
I have three lines like below - what I want to do is to print found
if at least 5 characters of the last 15 characters in the upper line have |
character under themselves (meaning these characters bind to the characters in the third line)
5' TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
:: ::: :: : : : ||||| :
3' ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG
So far I have ended up with this script that can find specific patterns but I have not been able to expand it to find what I just mentioned in the question - it would be great if someone could help me to expand the code below - thank you.
import re
in_str = '''
5' TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
:: ::: :: : : : ||||| :
3' ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''
in_str = re.sub(r'^\s*', "", in_str)
lst = re.split(r'\n', in_str)
acgt = set(['A', 'C', 'G', 'T'])
for idx in range(min([len(s) for s in lst])):
if lst[0][idx] in acgt and lst[1][idx] == '|' and lst[2][idx] in acgt:
print('found!')
break
Based on the structure of the three lines the output of the script should be found
since the 5 characters (from the last 15 characters of the upper string) has |
under themselves.
Upvotes: 0
Views: 199
Reputation: 344
How about something like this. First slice the second line based on the length of first line and then count the occurrence of |
in that slice.
in_str = '''
5' TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
:: ::: :: : : : ||||| :
3' ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''
lines = in_str.strip().split("\n")
first_line_length = len(lines[0]) # 62
valid_char_start_index = (first_line_length - 15 if first_line_length > 15 else 1) - 1 # 46
required_slice = lines[1][valid_char_start_index:first_line_length] # : ||||| : #
occurrence_of_pipe_in_slice = required_slice.count("|") # 5
if occurrence_of_pipe_in_slice >= 5:
print("found")
In case the mapping is done in the beginning as mentioned in the comment, we can do something like this:
import re
in_str = '''
5' TCAGATGTGTATAAGAGACAGGTGTAATCGTTCCGCTTGAATGTACGTCATGAA
||||| :: : :: :: : : ::
3' ATTTGCAGTACACTCGTTACCGAGGCATGACAGAGAATATGTGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''
lines = in_str.strip().split("\n")
first_line_prefix = re.match(r"\d\'\s+", lines[0]).group() # "5' "
third_line_prefix = re.match(r"\d\'\s+",lines[2]).group() # "3' "
if len(first_line_prefix) < len(third_line_prefix): # To make sure that which ever line starts first we are picking correct index
valid_char_start_index = len(first_line_prefix)
else:
valid_char_start_index = len(third_line_prefix)
# valid_char_start_index = 3
try:
required_slice = lines[1][valid_char_start_index:valid_char_start_index+15] # ' ||||| '
except IndexError:
required_slice = lines[1][valid_char_start_index:] # in case the second line is smaller than the required slice it
# will consider to the end of the line
occurrence_of_pipe_in_slice = required_slice.count("|") # 5
if occurrence_of_pipe_in_slice >= 5:
print("found")
To Summarise we can create a function where can take input parameter as where we want to look for the mapping beginning or in the end. And then determine start and end indices accordingly. From there in both cases rest of the process is same.
Upvotes: 1
Reputation: 44213
You can use the collections.Counter
class to help with counting:
from collections import Counter
in_str = '''5' TCAGATGTGTATAAGAGACAGTGCGTATTCTCAGTCAGTTGAAGTGATACAGAA
:: ::: :: : : : ||||| :
3' ATTCAGCCTGCACTCGTTACCGAGGCATGACAGAGAATTCACGTAGAGGCGAGCTAAGGTACTTGAAAGGGTGTATTAGAG'''
# split string into lines:
strs = in_str.split('\n')
line1 = strs[0]
line2 = strs[1]
length1 = len(line1)
assert length1 >= 15 # we assume this
# last 15 characters of the first line
suffix1 = line1[length1-15:length1] # or line1[-15:]
# the 15 characters of the second line that are under the above characters
suffix2 = line2[length1-15:length1]
# count occurrences of each characters:
c = Counter(suffix2)
found = c.get('|', 0) >= 5
print(found)
Prints:
True
Upvotes: 2