Joe Julen
Joe Julen

Reputation: 71

RegEx for matching specific element of HTML

I am working on a Python code which extracts specific elements from websites and the print it on a GUI implemented through the tkinter module. To extract specific elements from a webpage require the use of regex to which I am currently new and though I am able to obtain various elements, I am still finding it difficult to extract certain elements. One such example is presented below.

<div class="updated published time-details"><a class="url" 
    href="https://thetriffid.com.au/gig/chocolate-starfish-one-last-kick/" 
    title="CHOCOLATE STARFISH (AUS) &#8220;ONE LAST KICK&#8221;" 
    rel="bookmark"><span class="tribe-event-date-start">Sat Aug 3 @ 8:00 
    pm</span>
    </a>
</div>

This is a part of HTML code from which I just need the title i.e. "Chocolate Starfish (AUS) & One Last Kick". I am using the findall method and we are not allowed to use another external library such as Beautiful Soup. So, we have to work with findall, finditer, MULTILINE and DOTALL.

How do I get the desired outcome?

Upvotes: 1

Views: 285

Answers (2)

user557597
user557597

Reputation:

This is a good regex to find 'a' tags with 'title' attribute which is in Group 2.

Stringed

r"(?si)<a(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\stitle\s*=\s*(['\"])(.*?)\1)(?:\".*?\"|'.*?'|[^>]*?)+>"

Readable version

 (?si)

 <a
 (?=
      (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
      \s title \s* = \s* 
      ( ['"] )                      # (1)
      ( .*? )                       # (2)
      \1 
 )
 (?: " .*? " | ' .*? ' | [^>]*? )+
 >

Benchmark using a large web page (cnn.com) and 300 iterations

Regex1:   (?si)<a(?=(?:[^>"']|"[^"]*"|'[^']*')*?\stitle\s*=\s*(['"])(.*?)\1)(?:".*?"|'.*?'|[^>]*?)+>
Options:  < none >
Completed iterations:   300  /  300     ( x 1 )
Matches found per iteration:   285
Elapsed Time:    3.26 s,   3262.08 ms,   3262081 µs
Matches per sec:   26,210

Upvotes: 1

jspcal
jspcal

Reputation: 51904

Using an HTML-aware solution like BeautifulSoup would handle more cases, but if you're sure the HTML will always conform to your example, you can use a rough regex match like:

re.findall('<a.*? title=\"(.*?)\"', html, re.DOTALL)
# ['CHOCOLATE STARFISH (AUS) &#8220;ONE LAST KICK&#8221;']

Upvotes: 2

Related Questions