Jomasdf
Jomasdf

Reputation: 268

Adding string to scraped url (scrapy)

I have made a scraper to go thru threads in a forum and save all links posted by users. The problem is that the forum uses a "do you really want to leave the site" thing. This makes the links I scrape incomplete like so:

/leave.php?u=http%3A%2F%2Fwww.lonestatistik.se%2Floner.asp%2Fyrke%2FUnderskoterska-1242

To work it would need the websites domain in the beginning of the link.

Is there a way to add it some way? Or to just scrape the target url.

def parse(self, response):
    next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
    if len(next_link):
        yield self.make_requests_from_url(urljoin(response.url, next_link))

    posts = Selector(response).xpath('//div[@class="post_message"]')
    for post in posts:
        i = TextPostItem()
        i['url'] = post.xpath('a/@href').extract()

        yield i

-edit- So, based on eLRuLL's answer I did this.

def parse(self, response):
    next_link = response.xpath("//a[contains(., '>')]//@href").extract()[0]
    if len(next_link):
        yield self.make_requests_from_url(urljoin(response.url, next_link))
    posts = Selector(response).xpath('//div[@class="post_message"]')
    for post in posts:
        i = TextPostItem()
        url = post.xpath('./a/@href').extract_first()
        i['new_url'] = urljoin(response.url, url)

        yield i

Which worked. Except for that I now scrape an url for every single post, even if that post didnt have a link posted.

Upvotes: 0

Views: 900

Answers (1)

eLRuLL
eLRuLL

Reputation: 18799

looks like you need to add the domain url at the beginning of that new url. You could try to use the response.url to append the base url to the new one, so something like:

from urlparse import urljoin
...
url = post.xpath('./a/@href').extract_first()
new_url = urljoin(response.url, url) # someurl.com/leave.php?...
yield Request(new_url, ...)
...

Upvotes: 1

Related Questions