Reputation: 1

Scrape specific urls from a page and convert them to absolute urls

I need some help from you Pythonists: I'm scraping all urls starting with "details.php?" from this page and ignoring all other urls.

Then I need to convert every url I just scraped to an absolute url, so I can scrape them one by one. The absolute urls start with: http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?...

I tried using re.findall like this:

html = scraperwiki.scrape(url)
if html is not None:
    endofurl = re.findall("details.php?(.*?)>", html)

This gets me a list, but then I get stuck. Can anybody help me out?

Upvotes: 0

Answers (4)

stranac

Reputation: 28256

You can use urlparse.urljoin() to create the full urls:

>>> import urlparse
>>> base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/'
>>> urlparse.urljoin(base_url, 'details.php?whatever')
'http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?whatever'

You can use a list comprehension to do this for all of your urls:

full_urls = [urlparse.urljoin(base_url, url) for url in endofurl]

Upvotes: 3

jfs

Reputation: 414555

If you'd like to use lxml.html to parse html; there is .make_links_absolute():

import lxml.html

html = lxml.html.make_links_absolute(html,
    base_href="http://evenementen.uitslagen.nl/2013/marathonrotterdam/")

Upvotes: 0

BenDundee

Reputation: 4521

Ah! My favorite...list comprehensions!

base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/{0}'
urls = [base.format(x) for x in list_of_things_you_scraped]

I'm not a regex genius, so you may need to fiddle with base_url until you get it exactly right.

Upvotes: 0

Bhavish Agarwal

Reputation: 673

If you need the final urls one by one and be done with them, you should use generator instead of the iterators.

abs_url = "url data"
urls = (abs_url+url for url in endofurl)

If you are worried about encoding the url you can use urllib.urlencode(url)

Upvotes: 0

Scrape specific urls from a page and convert them to absolute urls

Answers (4)

Related Questions