ADJ
ADJ

Reputation: 5282

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.

import re

string="<font size="2px" face="Tahoma"><br>Good Morning,&nbsp;</font><div><br></div><div>As per last email"

pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'

re.findall(pattern, string)

['<font size="2px" face="Tahoma">',
 '<br>',
 '</font>',
 '<div>',
 '<br>',
 '</div>',
 '<div>']

desired_string = "Good Morning,&nbsp;As per last email"

Upvotes: 0

Views: 40

Answers (2)

Bryan Oakley
Bryan Oakley

Reputation: 385980

Instead of re.findall, use re.sub to replace each matche with an empty string.

re.sub(pattern, "", string)

While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432

Upvotes: 3

Andy
Andy

Reputation: 50560

Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.

from bs4 import BeautifulSoup

string="""<font size="2px" face="Tahoma"><br>Good Morning,&nbsp;</font><div><br></div><div>As per last email"""

soup = BeautifulSoup(string, 'lxml')

print(soup.get_text())

This outputs:

Good Morning, As per last email

One thing to notice is that the &nbsp; was changed to a regular space using this method.

Upvotes: 1

Related Questions