thomas_chang
thomas_chang

Reputation: 97

How to extract a sub part of HTML code by a raw string?

I want to extract a part of HTML code I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> from the following code:

<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code.

And because the HTML tags in the code may be changed, I try to find this sub part of the code by searching the code with the following raw string ' I want to find a substring - abcd1234+ '.

To do this, I first use the escape in the re module to escape special characters in the whole HTML code above and the raw string. Then, I replace the word boundary in the raw string with the regular expression pattern (<.*?>)* to match any HTML tags. Finally, I use the regular expression pattern which I just create to search the HTML code.

Below is the snapshot of my code:

import re

htmlCode = '<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code.'
htmlCodeWithBackSlash = re.escape(htmlCode)

rawString = 'I want to find a substring - abcd1234+'
patternWithBackSlash = re.escape(rawString)
patternWithHTMLTags = re.sub(r'\b', '(<.*?>)*', patternWithBackSlash)

m = re.search(patternWithHTMLTags, htmlCodeWithBackSlash)

if m is not None:
    print(f'm.group() = {m.group()}')
else:
    print('Not matched!')

But the result is " Not matched! ". The code above fails to extract the subpart of the code which I want. Can anyone let me know why this code fails and how to fix it?

Upvotes: 0

Views: 212

Answers (2)

thomas_chang
thomas_chang

Reputation: 97

I found which part of my code went wrong. Instead of escaping the special characters in the HTML code by escape function in the re module, I should use the prefix r to escape the special characters.

Below is the revised code which should work for this case:

import re

htmlCode = r'<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code.'

rawString = 'I want to find a substring - abcd1234+'
patternWithBackSlash = re.escape(rawString)
patternWithHTMLTags = re.sub(r'\b', '(<.*?>)*', patternWithBackSlash)

m = re.search(patternWithHTMLTags, htmlCode)

if m is not None:
    print(f'm.group() = {m.group()}')
else:
    print('Not matched!')

I still came across other problems related to special characters when I tried to extract parts of the HMTL code by regex. I will follow the suggestions in the comments to find the way to do this by the BeautifulSoup module. And if I succeed, I will share my answer here.

Upvotes: 0

Bhargav
Bhargav

Reputation: 4286

Regex is not a great/Suggestable tool to parse HTML. I'm just adding solution with Regex as you looking for that.

using bs4

from bs4 import BeautifulSoup
html ="""<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code."""

soup = BeautifulSoup(html, 'html.parser')

all_items = soup.find_all('span')

for item in all_items:
    print(item.text)

#output

abcd1234+

Also

uisng Regex #This will endup as disaster if there are multiple tags and pages ..

html ="""<p>This is a test. I want to find a substring - <span style="color: white; background-color: blue; font-weight: bold;">abcd1234+</span></p> This is the end of the code."""



import re
start = '">'
end = '</span>'

print( (html[html.find(start)+len(start):html.rfind(end)]))
print("\n")

output#

abcd1234+

Upvotes: 1

Related Questions