Reputation: 71
I am working on a Python code which extracts specific elements from websites and the print it on a GUI implemented through the tkinter module. To extract specific elements from a webpage require the use of regex to which I am currently new and though I am able to obtain various elements, I am still finding it difficult to extract certain elements. One such example is presented below.
<div class="updated published time-details"><a class="url"
href="https://thetriffid.com.au/gig/chocolate-starfish-one-last-kick/"
title="CHOCOLATE STARFISH (AUS) “ONE LAST KICK”"
rel="bookmark"><span class="tribe-event-date-start">Sat Aug 3 @ 8:00
pm</span>
</a>
</div>
This is a part of HTML code from which I just need the title i.e. "Chocolate Starfish (AUS) & One Last Kick". I am using the findall method and we are not allowed to use another external library such as Beautiful Soup. So, we have to work with findall, finditer, MULTILINE and DOTALL.
How do I get the desired outcome?
Upvotes: 1
Views: 285
Reputation:
This is a good regex to find 'a' tags with 'title' attribute which is in Group 2.
Stringed
r"(?si)<a(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?\stitle\s*=\s*(['\"])(.*?)\1)(?:\".*?\"|'.*?'|[^>]*?)+>"
Readable version
(?si)
<a
(?=
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
\s title \s* = \s*
( ['"] ) # (1)
( .*? ) # (2)
\1
)
(?: " .*? " | ' .*? ' | [^>]*? )+
>
Benchmark using a large web page (cnn.com) and 300 iterations
Regex1: (?si)<a(?=(?:[^>"']|"[^"]*"|'[^']*')*?\stitle\s*=\s*(['"])(.*?)\1)(?:".*?"|'.*?'|[^>]*?)+>
Options: < none >
Completed iterations: 300 / 300 ( x 1 )
Matches found per iteration: 285
Elapsed Time: 3.26 s, 3262.08 ms, 3262081 µs
Matches per sec: 26,210
Upvotes: 1
Reputation: 51904
Using an HTML-aware solution like BeautifulSoup
would handle more cases, but if you're sure the HTML will always conform to your example, you can use a rough regex match like:
re.findall('<a.*? title=\"(.*?)\"', html, re.DOTALL)
# ['CHOCOLATE STARFISH (AUS) “ONE LAST KICK”']
Upvotes: 2