Tarun Kumar
Tarun Kumar

Reputation: 875

Replacing a regex match with the length of matched text

How to replace patterns like <html> and </mainbody> by <4> and </8>. Here 4 and 8 are the number of alphabets inside <>. Input is to be taken from file.

import re
def main():
        fh=open("input.txt")
        pattern=re.compile("</?[a-zA-Z]+>") #regular expression to find patterns <html>, </html> 
        for line in fh:
                print(re.sub(pattern,"***",line.strip()))



if __name__=="__main__":main()

Upvotes: 0

Views: 80

Answers (1)

Burhan Khalid
Burhan Khalid

Reputation: 174662

Use a custom method to return the length of the match:

def get_length(obj):
    s = obj.groups()[0]
    return '</{}>'.format(len(s[1:])) if s.startswith('/') else '<{}>'.format(len(s))

>>> re.sub("<(/?[a-zA-Z]+)>", get_length, '<html>')
'<4>'
>>> re.sub("<(/?[a-zA-Z]+)>", get_length, '</html>')
'</4>'

I hope you realize your regular expression is very basic, and it won't deal with tags that have attributes correctly.

Upvotes: 1

Related Questions