Reputation: 23
I am trying to parse the text output from samtools mpileup. I start with a string
s = '.$......+2AG.+2AG.+2AGGG'
Whenever I have a +
followed by an integer n
, I would like to select n characters following that integer and replace the whole thing by *
. So for this test case I would have
'.$......+2AG.+2AG.+2AGGG' ---> '.$......*.*.*GG'
I have the regex \+[0-9]+[ACGTNacgtn]+
but that results in the output .$......*.*.*
and the trailing G's are lost as well. How do I select n characters where the n is not known ahead of time but specified in the string itself?
Upvotes: 2
Views: 106
Reputation: 18940
The repl
argument in re.sub
can be a string or a function.
So, you can do very complex things with function replacements:
def removechars(m):
x=m.group()
n=re.match(r'\+(\d+).*', x).group(1) # digit part
return '*'+x[1+len(n)+int(n):]
Solves your problem:
>>> re.sub(r'\+[0-9]+[ACGTNacgtn]+', removechars, s)
'.$......*.*.*GG'
Upvotes: 1
Reputation: 96
Not the most elegant, but I pulled out the numeric values using re.findall
before running re.sub
.
ls=re.findall('\+(\d)',s)
for i in ls:
s=re.sub('\+(%s\w{%s})' % (i,i),'*',s)
Upvotes: 0