Reputation: 77
I extract some code from a web page (http://www.opensolaris.org/os/community/on/flag-days/all/) like follows,
<tr class="build">
<th colspan="0">Build 110</th>
</tr>
<tr class="arccase project flagday">
<td>Feb-25</td>
<td></td>
<td></td>
<td></td>
<td>
<a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />
cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br />
CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br />
</td>
</tr>
and there are some relative url path in it, now I want to search it with regular expression and replace them with absolute url path. Since I know urljoin can do the replace work like that,
>>> urljoin("http://www.opensolaris.org/os/community/on/flag-days/all/",
... "/os/community/arc/caselog/2008/777/")
'http://www.opensolaris.org/os/community/arc/caselog/2008/777/'
Now I want to know that how to search them using regular expressions, and finally tanslate the code to,
<tr class="build">
<th colspan="0">Build 110</th>
</tr>
<tr class="arccase project flagday">
<td>Feb-25</td>
<td></td>
<td></td>
<td></td>
<td>
<a href="http://www.opensolaris.org/os/community/on/flag-days/all//pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />
cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br />
CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br />
</td>
</tr>
My knowledge of regular expressions is so poor that I want to know how to do that. Thanks
I have finished the work using Beautiful Soup, haha~ Thx for everybody!
Upvotes: 2
Views: 4956
Reputation: 29452
this isn't elegant, but does the job:
import re
from urlparse import urljoin
relative_urls_re = re.compile('(<\s*a[^>]+href\s*=\s*["\']?)(?!http)([^"\'>]+)', re.IGNORECASE)
relative_urls_re.sub(lambda m: m.group(1) + urljoin(base_url, m.group(2)), html)
Upvotes: 2
Reputation: 89171
First, I'd recommend using a HTML parser, such as BeautifulSoup. HTML is not a regular language, and thus can't be parsed fully by regular expressions alone. Parts of HTML can be parsed though.
If you don't want to use a full HTML parser, you could use something like this to approximate the work:
import re, urlparse
find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)')
def fix_urls(document, base_url):
ret = []
last_end = 0
for match in find_re.finditer(document):
url = match.group(1)
if url[0] in "\"'":
url = url.strip(url[0])
parsed = urlparse.urlparse(url)
if parsed.scheme == parsed.netloc == '': #relative to domain
url = urlparse.urljoin(base_url, url)
ret.append(document[last_end:match.start(1)])
ret.append('"%s"' % (url,))
last_end = match.end(1)
ret.append(document[last_end:])
return ''.join(ret)
Example:
>>> document = '''<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>'''
>>> fix_urls(document,"http://www.opensolaris.org/os/community/on/flag-days/all/")
'<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="http://www.opensolaris.org/os/community/on/flag-days/pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>'
>>>
Upvotes: 3
Reputation:
Don't use regular expressions to parse HTML. Use a real parser for that. For example BeautifulSoup.
Upvotes: 2
Reputation: 655299
Something like this should do it:
"(?:[^/:"]+|/(?!/))(?:/[^/"]+)*"
Upvotes: 1
Reputation: 179119
I'm not sure about what you're trying to achieve but using the BASE tag in HTML may do this trick for you without having to resort to regular expressions when doing the processing.
Upvotes: 5