gmax2017
gmax2017

Reputation: 37

Find substring by using python

I extracted a raw string from a Q&A forum. I have a string like this:

s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'

I want to extract this substring "<font color="blue"><font face="Times New Roman">" and assign it to a new variable. I am able to remove it with regex but I don't know how to assign it to a new variable. I am new to regex.

import re
s1 = re.sub('<.*?>', '', s)

This is removes the sub but I'd like to keep the removed sub for the record, ideally reassign it to a varialbe.

How can I do this? I may prefer regular expressions.

Upvotes: 0

Views: 84

Answers (2)

saurabh baid
saurabh baid

Reputation: 1877

Though bs4 is more approprate for webscraping but if you are okay with regex for your case you could do following

>>> import re
>>> s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'
>>> regex = re.compile('<.*?>')
>>> regex.findall(s)
['<font color="blue">', '<font face="Times New Roman">', '<font color="green">', '<font face="Arial">']
>>> regex.sub('', s)
'Take about 2 + but double check with teacher before you do'

Upvotes: 1

rednafi
rednafi

Reputation: 1731

Regex is not exactly the easiest tool to parse HTML components. You can try using BeautifulSoup to parse the components and make your substring.

from bs4 import BeautifulSoup

s = """Take about 2 + <font color="blue">
       <font face="Times New Roman">but double check with teacher <font color="green">
       <font face="Arial">before you do"""


soup = BeautifulSoup(s, "html.parser")

Print the html:

Take about 2 +
<font color="blue">
 <font face="Times New Roman">
  but double check with teacher
  <font color="green">
   <font face="Arial">
    before you do
   </font>
  </font>
 </font>
</font>

Extract components:

soup.font.font['face']
> 'Times New Roman'
soup.font["color"]
> 'blue'

Now make and save your substring as a variable:

variable = f"<font color={soup.font.font['face']}><font face={soup.font.font['face']}>"

This will give you:

"<font color="blue"><font face="Times New Roman">"

Upvotes: 0

Related Questions