Reputation: 532
I am trying to match html texts converted into strings. But None of my regex is working.
Html texts I am trying to match from:
"[<span class="instancename">CLASS 8B PHY | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]"
The sentences that I want to match are:
CLASS 8B PHY | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
and many more in the html texts above provided
The code that I am using to match these doesn't seems to work:
import re
html_text = [<span class="instancename">CLASS 8B PHY | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]
regex = re.compile(r'^[CLASS]*[M]')
match = regex.findall(str(html_text))
print(match)
I think I am not providing the right regex to match
Upvotes: 0
Views: 59
Reputation: 627087
You are dealing with HTML, so it makes sense to use BeautifulSoup to parse HTML in Python.
from bs4 import BeautifulSoup
s = """Your HTML goes here""" # 's' is a string variable I initialized the `doc`ument
doc = BeautifulSoup(s, 'html.parser')
for span in doc.find_all("span", attrs={'class':"instancename"}):
innerspans = [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
print(span.text)
Output:
CLASS 8B PHY | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM
CLASS 8AB 2ND LG (BENGALI) | TUE | 10AM to 10:40AM
CLASS 8AB 2ND LG (NEPALI) | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
CLASS 8B BIO | TUE | 12NOON to 12:40PM
CLASS 8AB CP APP | TUE | 5PM to 5:40PM
CLASS 8AB CM APP | TUE | 5PM to 5:40PM
Note that [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
extracts the span
elements with accesshide
class, and removes them from span
. So, the actual text left is the span
text without the text of the inner span
s.
Upvotes: 1
Reputation: 781716
try
regex = re.compile(r'CLASS.*?[\d:]+[AP]M to [\d:]+[AP]M')
^
, because then it will only match at the beginning and won't find all the matches.CLASS
shouldn't be in square brackets. [CLASS]
matches a single character that's either C
, L
, 'A, or 'S
..*
to match any text after CLASS
. And make it non-greedy using ?
.M
at the end, because then it will match the next M
anywhere in the string. You should only match it after a time and A
or P
. And you also need to match the start time and to
, so it doesn't stop matching at the start time.Upvotes: 0