lucas
lucas

Reputation: 77

How to find a relative URL and translate it to an absolute URL in Python

I extract some code from a web page (http://www.opensolaris.org/os/community/on/flag-days/all/) like follows,

<tr class="build">
  <th colspan="0">Build 110</th>
</tr>
<tr class="arccase project flagday">
  <td>Feb-25</td>
  <td></td>
  <td></td>
  <td></td>
  <td>
    <a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />
    cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br />
    CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br />
  </td>
</tr>

and there are some relative url path in it, now I want to search it with regular expression and replace them with absolute url path. Since I know urljoin can do the replace work like that,

>>> urljoin("http://www.opensolaris.org/os/community/on/flag-days/all/",
...         "/os/community/arc/caselog/2008/777/")
'http://www.opensolaris.org/os/community/arc/caselog/2008/777/'

Now I want to know that how to search them using regular expressions, and finally tanslate the code to,

<tr class="build">
  <th colspan="0">Build 110</th>
</tr>
<tr class="arccase project flagday">
  <td>Feb-25</td>
  <td></td>
  <td></td>
  <td></td>
  <td>
    <a href="http://www.opensolaris.org/os/community/on/flag-days/all//pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />
    cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br />
    CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br />
  </td>
</tr>

My knowledge of regular expressions is so poor that I want to know how to do that. Thanks

I have finished the work using Beautiful Soup, haha~ Thx for everybody!

Upvotes: 2

Views: 4956

Answers (5)

hoju
hoju

Reputation: 29452

this isn't elegant, but does the job:

import re
from urlparse import urljoin
relative_urls_re = re.compile('(<\s*a[^>]+href\s*=\s*["\']?)(?!http)([^"\'>]+)', re.IGNORECASE)
relative_urls_re.sub(lambda m: m.group(1) + urljoin(base_url, m.group(2)), html)

Upvotes: 2

Markus Jarderot
Markus Jarderot

Reputation: 89171

First, I'd recommend using a HTML parser, such as BeautifulSoup. HTML is not a regular language, and thus can't be parsed fully by regular expressions alone. Parts of HTML can be parsed though.

If you don't want to use a full HTML parser, you could use something like this to approximate the work:

import re, urlparse

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)')

def fix_urls(document, base_url):
    ret = []
    last_end = 0
    for match in find_re.finditer(document):
        url = match.group(1)
        if url[0] in "\"'":
            url = url.strip(url[0])
        parsed = urlparse.urlparse(url)
        if parsed.scheme == parsed.netloc == '': #relative to domain
            url = urlparse.urljoin(base_url, url)
            ret.append(document[last_end:match.start(1)])
            ret.append('"%s"' % (url,))
            last_end = match.end(1)
    ret.append(document[last_end:])
    return ''.join(ret)

Example:

>>> document = '''<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="../pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>'''
>>> fix_urls(document,"http://www.opensolaris.org/os/community/on/flag-days/all/")
'<tr class="build"><th colspan="0">Build 110</th></tr> <tr class="arccase project flagday"><td>Feb-25</td><td></td><td></td><td></td><td><a href="http://www.opensolaris.org/os/community/on/flag-days/pages/2009022501/">Flag Day and Heads Up: Power Aware Dispatcher and Deep C-States</a><br />cpupm keyword mode extensions - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/777/">PSARC/2008/777</a><br /> CPU Deep Idle Keyword - <a href="http://www.opensolaris.org/os/community/arc/caselog/2008/663/">PSARC/2008/663</a><br /></td></tr>'
>>>

Upvotes: 3

unbeknown
unbeknown

Reputation:

Don't use regular expressions to parse HTML. Use a real parser for that. For example BeautifulSoup.

Upvotes: 2

Gumbo
Gumbo

Reputation: 655299

Something like this should do it:

"(?:[^/:"]+|/(?!/))(?:/[^/"]+)*"

Upvotes: 1

Ionuț G. Stan
Ionuț G. Stan

Reputation: 179119

I'm not sure about what you're trying to achieve but using the BASE tag in HTML may do this trick for you without having to resort to regular expressions when doing the processing.

Upvotes: 5

Related Questions