Reputation: 37
I extracted a raw string from a Q&A forum. I have a string like this:
s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'
I want to extract this substring "<font color="blue"><font face="Times New Roman">
" and assign it to a new variable. I am able to remove it with regex but I don't know how to assign it to a new variable. I am new to regex.
import re
s1 = re.sub('<.*?>', '', s)
This is removes the sub but I'd like to keep the removed sub for the record, ideally reassign it to a varialbe.
How can I do this? I may prefer regular expressions.
Upvotes: 0
Views: 84
Reputation: 1877
Though bs4 is more approprate for webscraping but if you are okay with regex for your case you could do following
>>> import re
>>> s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'
>>> regex = re.compile('<.*?>')
>>> regex.findall(s)
['<font color="blue">', '<font face="Times New Roman">', '<font color="green">', '<font face="Arial">']
>>> regex.sub('', s)
'Take about 2 + but double check with teacher before you do'
Upvotes: 1
Reputation: 1731
Regex is not exactly the easiest tool to parse HTML components. You can try using BeautifulSoup
to parse the components and make your substring.
from bs4 import BeautifulSoup
s = """Take about 2 + <font color="blue">
<font face="Times New Roman">but double check with teacher <font color="green">
<font face="Arial">before you do"""
soup = BeautifulSoup(s, "html.parser")
Print the html:
Take about 2 +
<font color="blue">
<font face="Times New Roman">
but double check with teacher
<font color="green">
<font face="Arial">
before you do
</font>
</font>
</font>
</font>
Extract components:
soup.font.font['face']
> 'Times New Roman'
soup.font["color"]
> 'blue'
Now make and save your substring as a variable:
variable = f"<font color={soup.font.font['face']}><font face={soup.font.font['face']}>"
This will give you:
"<font color="blue"><font face="Times New Roman">"
Upvotes: 0