Reputation: 97
I want to extract a part of HTML code I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p>
from the following code:
<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code.
And because the HTML tags in the code may be changed, I try to find this sub part of the code by searching the code with the following raw string ' I want to find a substring - abcd1234+ '.
To do this, I first use the escape
in the re
module to escape special characters in the whole HTML code above and the raw string. Then, I replace the word boundary in the raw string with the regular expression pattern (<.*?>)*
to match any HTML tags. Finally, I use the regular expression pattern which I just create to search the HTML code.
Below is the snapshot of my code:
import re
htmlCode = '<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code.'
htmlCodeWithBackSlash = re.escape(htmlCode)
rawString = 'I want to find a substring - abcd1234+'
patternWithBackSlash = re.escape(rawString)
patternWithHTMLTags = re.sub(r'\b', '(<.*?>)*', patternWithBackSlash)
m = re.search(patternWithHTMLTags, htmlCodeWithBackSlash)
if m is not None:
print(f'm.group() = {m.group()}')
else:
print('Not matched!')
But the result is " Not matched! ". The code above fails to extract the subpart of the code which I want. Can anyone let me know why this code fails and how to fix it?
Upvotes: 0
Views: 212
Reputation: 97
I found which part of my code went wrong. Instead of escaping the special characters in the HTML code by escape
function in the re
module, I should use the prefix r
to escape the special characters.
Below is the revised code which should work for this case:
import re
htmlCode = r'<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code.'
rawString = 'I want to find a substring - abcd1234+'
patternWithBackSlash = re.escape(rawString)
patternWithHTMLTags = re.sub(r'\b', '(<.*?>)*', patternWithBackSlash)
m = re.search(patternWithHTMLTags, htmlCode)
if m is not None:
print(f'm.group() = {m.group()}')
else:
print('Not matched!')
I still came across other problems related to special characters when I tried to extract parts of the HMTL code by regex. I will follow the suggestions in the comments to find the way to do this by the BeautifulSoup
module. And if I succeed, I will share my answer here.
Upvotes: 0
Reputation: 4286
Regex is not a great/Suggestable tool to parse HTML. I'm just adding solution with Regex
as you looking for that.
using bs4
from bs4 import BeautifulSoup
html ="""<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code."""
soup = BeautifulSoup(html, 'html.parser')
all_items = soup.find_all('span')
for item in all_items:
print(item.text)
#output
abcd1234+
Also
uisng Regex
#This will endup as disaster if there are multiple tags
and pages
..
html ="""<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code."""
import re
start = '">'
end = '</span>'
print( (html[html.find(start)+len(start):html.rfind(end)]))
print("\n")
output#
abcd1234+
Upvotes: 1