Unable to match specific string with regex

Question

I am trying to match html texts converted into strings. But None of my regex is working.

Html texts I am trying to match from:

"[CLASS 8B PHY  | TUE | 9AM to 9:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM  BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8B GEOG | TUE | 11AM to 11:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8B BIO | TUE | 12NOON to 12:40PM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB CP APP | TUE | 5PM to 5:40PM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB CM APP | TUE | 5PM to 5:40PM BigBlueButtonBN,  BigBlueButtonBN]"

The sentences that I want to match are:

CLASS 8B PHY | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM

and many more in the html texts above provided

The code that I am using to match these doesn't seems to work:

import re
html_text = [CLASS 8B PHY  | TUE | 9AM to 9:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM  BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8B GEOG | TUE | 11AM to 11:40AM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8B BIO | TUE | 12NOON to 12:40PM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB CP APP | TUE | 5PM to 5:40PM BigBlueButtonBN,  BigBlueButtonBN, CLASS 8AB CM APP | TUE | 5PM to 5:40PM BigBlueButtonBN,  BigBlueButtonBN]

regex = re.compile(r'^[CLASS]*[M]')
match = regex.findall(str(html_text))
print(match)

I think I am not providing the right regex to match

Wiktor Stribiżew · Accepted Answer

You are dealing with HTML, so it makes sense to use BeautifulSoup to parse HTML in Python.

from bs4 import BeautifulSoup
s = """Your HTML goes here""" # 's' is a string variable I initialized the `doc`ument
doc = BeautifulSoup(s, 'html.parser')
for span in doc.find_all("span", attrs={'class':"instancename"}):
    innerspans = [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
    print(span.text)

Output:

CLASS 8B PHY  | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM
CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM 
CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
CLASS 8B BIO | TUE | 12NOON to 12:40PM
CLASS 8AB CP APP | TUE | 5PM to 5:40PM
CLASS 8AB CM APP | TUE | 5PM to 5:40PM

Note that [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})] extracts the span elements with accesshide class, and removes them from span. So, the actual text left is the span text without the text of the inner spans.

Unable to match specific string with regex

Answers (2)

Related Questions