Ayush Kumar
Ayush Kumar

Reputation: 532

Unable to match specific string with regex

I am trying to match html texts converted into strings. But None of my regex is working.

Html texts I am trying to match from:

"[<span class="instancename">CLASS 8B PHY  | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]"

The sentences that I want to match are:

  1. CLASS 8B PHY | TUE | 9AM to 9:40AM

  2. CLASS 8AB 2ND LG (HINDI) | TUE | 10AM to 10:40AM

  3. CLASS 8B GEOG | TUE | 11AM to 11:40AM

and many more in the html texts above provided

The code that I am using to match these doesn't seems to work:

import re
html_text = [<span class="instancename">CLASS 8B PHY  | TUE | 9AM to 9:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM <span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B GEOG | TUE | 11AM to 11:40AM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8B BIO | TUE | 12NOON to 12:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CP APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>, <span class="instancename">CLASS 8AB CM APP | TUE | 5PM to 5:40PM<span class="accesshide"> BigBlueButtonBN</span></span>, <span class="accesshide"> BigBlueButtonBN</span>]

regex = re.compile(r'^[CLASS]*[M]')
match = regex.findall(str(html_text))
print(match)

I think I am not providing the right regex to match

Upvotes: 0

Views: 59

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

You are dealing with HTML, so it makes sense to use BeautifulSoup to parse HTML in Python.

from bs4 import BeautifulSoup
s = """Your HTML goes here""" # 's' is a string variable I initialized the `doc`ument
doc = BeautifulSoup(s, 'html.parser')
for span in doc.find_all("span", attrs={'class':"instancename"}):
    innerspans = [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})]
    print(span.text)

Output:

CLASS 8B PHY  | TUE | 9AM to 9:40AM
CLASS 8AB 2ND LG (HINDI)  | TUE | 10AM to 10:40AM
CLASS 8AB 2ND LG (BENGALI)  | TUE | 10AM to 10:40AM 
CLASS 8AB 2ND LG (NEPALI)  | TUE | 10AM to 10:40AM
CLASS 8B GEOG | TUE | 11AM to 11:40AM
CLASS 8B BIO | TUE | 12NOON to 12:40PM
CLASS 8AB CP APP | TUE | 5PM to 5:40PM
CLASS 8AB CM APP | TUE | 5PM to 5:40PM

Note that [x.extract() for x in span.find_all("span", attrs={'class':'accesshide'})] extracts the span elements with accesshide class, and removes them from span. So, the actual text left is the span text without the text of the inner spans.

Upvotes: 1

Barmar
Barmar

Reputation: 781716

try

regex = re.compile(r'CLASS.*?[\d:]+[AP]M to [\d:]+[AP]M')
  1. You shouldn't start the pattern with ^, because then it will only match at the beginning and won't find all the matches.
  2. CLASS shouldn't be in square brackets. [CLASS] matches a single character that's either C, L, 'A, or 'S.
  3. You need .* to match any text after CLASS. And make it non-greedy using ?.
  4. You can't just match M at the end, because then it will match the next M anywhere in the string. You should only match it after a time and A or P. And you also need to match the start time and to, so it doesn't stop matching at the start time.

Upvotes: 0

Related Questions