Reputation: 1
I need some help from you Pythonists: I'm scraping all urls starting with "details.php?" from this page and ignoring all other urls.
Then I need to convert every url I just scraped to an absolute url, so I can scrape them one by one. The absolute urls start with: http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?...
I tried using re.findall
like this:
html = scraperwiki.scrape(url)
if html is not None:
endofurl = re.findall("details.php?(.*?)>", html)
This gets me a list, but then I get stuck. Can anybody help me out?
Upvotes: 0
Views: 185
Reputation: 28256
You can use urlparse.urljoin()
to create the full urls:
>>> import urlparse
>>> base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/'
>>> urlparse.urljoin(base_url, 'details.php?whatever')
'http://evenementen.uitslagen.nl/2013/marathonrotterdam/details.php?whatever'
You can use a list comprehension to do this for all of your urls:
full_urls = [urlparse.urljoin(base_url, url) for url in endofurl]
Upvotes: 3
Reputation: 414555
If you'd like to use lxml.html
to parse html; there is .make_links_absolute()
:
import lxml.html
html = lxml.html.make_links_absolute(html,
base_href="http://evenementen.uitslagen.nl/2013/marathonrotterdam/")
Upvotes: 0
Reputation: 4521
Ah! My favorite...list comprehensions!
base_url = 'http://evenementen.uitslagen.nl/2013/marathonrotterdam/{0}'
urls = [base.format(x) for x in list_of_things_you_scraped]
I'm not a regex genius, so you may need to fiddle with base_url
until you get it exactly right.
Upvotes: 0
Reputation: 673
If you need the final urls one by one and be done with them, you should use generator instead of the iterators.
abs_url = "url data"
urls = (abs_url+url for url in endofurl)
If you are worried about encoding the url you can use urllib.urlencode(url)
Upvotes: 0