Adding string to scraped url (scrapy)

Question

I have made a scraper to go thru threads in a forum and save all links posted by users. The problem is that the forum uses a "do you really want to leave the site" thing. This makes the links I scrape incomplete like so:

/leave.php?u=http%3A%2F%2Fwww.lonestatistik.se%2Floner.asp%2Fyrke%2FUnderskoterska-1242

To work it would need the websites domain in the beginning of the link.

Is there a way to add it some way? Or to just scrape the target url.

def parse(self, response):
    next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
    if len(next_link):
        yield self.make_requests_from_url(urljoin(response.url, next_link))

    posts = Selector(response).xpath('//div[@class="post_message"]')
    for post in posts:
        i = TextPostItem()
        i['url'] = post.xpath('a/@href').extract()

        yield i

-edit- So, based on eLRuLL's answer I did this.

def parse(self, response):
    next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
    if len(next_link):
        yield self.make_requests_from_url(urljoin(response.url, next_link))
    posts = Selector(response).xpath('//div[@class="post_message"]')
    for post in posts:
        i = TextPostItem()
        url = post.xpath('./a/@href').extract_first()
        i['new_url'] = urljoin(response.url, url)

        yield i

Which worked. Except for that I now scrape an url for every single post, even if that post didnt have a link posted.

eLRuLL · Accepted Answer

looks like you need to add the domain url at the beginning of that new url. You could try to use the response.url to append the base url to the new one, so something like:

from urlparse import urljoin
...
url = post.xpath('./a/@href').extract_first()
new_url = urljoin(response.url, url) # someurl.com/leave.php?...
yield Request(new_url, ...)
...

Adding string to scraped url (scrapy)

Answers (1)

Related Questions