vk673
vk673

Reputation: 23

regex to match words of length specified within string

I am trying to parse the text output from samtools mpileup. I start with a string

s = '.$......+2AG.+2AG.+2AGGG'

Whenever I have a + followed by an integer n, I would like to select n characters following that integer and replace the whole thing by *. So for this test case I would have

'.$......+2AG.+2AG.+2AGGG' ---> '.$......*.*.*GG' 

I have the regex \+[0-9]+[ACGTNacgtn]+ but that results in the output .$......*.*.* and the trailing G's are lost as well. How do I select n characters where the n is not known ahead of time but specified in the string itself?

Upvotes: 2

Views: 106

Answers (2)

fferri
fferri

Reputation: 18940

The repl argument in re.sub can be a string or a function.

So, you can do very complex things with function replacements:

def removechars(m):
    x=m.group()
    n=re.match(r'\+(\d+).*', x).group(1) # digit part
    return '*'+x[1+len(n)+int(n):]

Solves your problem:

>>> re.sub(r'\+[0-9]+[ACGTNacgtn]+', removechars, s)
'.$......*.*.*GG'

Upvotes: 1

Daniel Marasco
Daniel Marasco

Reputation: 96

Not the most elegant, but I pulled out the numeric values using re.findall before running re.sub.

ls=re.findall('\+(\d)',s)

for i in ls:
    s=re.sub('\+(%s\w{%s})' % (i,i),'*',s)

Upvotes: 0

Related Questions