Following crawled links

Question

Having the following spider:

import scrapy
from final.items import FinalItem

class ScrapeMovies(scrapy.Spider):
    name='final'

    start_urls = [
        'https://www.trekearth.com/members/page1.htm?sort_by=md'
    ]

    def parse(self, response):
        for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):

            item = FinalItem()

            item['name'] = row.xpath('./td[2]//a/text()').extract_first()
            website = row.xpath('./td[2]//a/@href/text()').extract_first()
            request = scrapy.Request(website,
            callback=self.parse_page2)
            yield request

    def parse_page2(self, response):
            request.meta['item'] = item
            item['travelog'] = response.xpath('string(//div[@class="statistics-btm"]/ul//li[position()=4]/a)').extract_first()
            yield item

#       next_page=response.xpath('//div[@class="page-nav-btm"]/ul/li[last()]/a/@href').extract_first()
#       if next_page is not None:
#            next_page=response.urljoin(next_page)
#            yield scrapy.Request(next_page, callback=self.parse)

I have a table i want to scrape name (and other informations as well) from this table and then follow the link to each of the users profile and then gather some data from those profiles and then merge it into a single item.

Then i want to return to the main table and go to a next page of it till the end (final part of the code is responsible for that, it was commented out for convenience).

Code i wrote does not work properly. Error i have is:

TypeError: Request url must be str or unicode, got NoneType:

How to fix this? How to make it to properly crawl all of the data?

gangabass · Accepted Answer

You need this code (your XPath expressions are wrong):

def parse(self, response):
    for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):

        item = FinalItem()

        item['name'] = row.xpath('./td[2]//a/text()').extract_first()
        profile_url = row.xpath('./td[2]//a/@href').extract_first()
        yield scrapy.Request( url=response.urljoin(profile_url), callback=self.parse_profile, meta={"item": item } )

    next_page_url = response.xpath('//div[@class="page-nav-btm"]//li[last()]/a/@href').extract_first()
    if next_page_url:
        yield scrapy.Request( url=response.urljoin(next_page_url), callback=self.parse )

def parse_profile(self, response):
        item = response.meta['item']
        item['travelog'] = response.xpath('//div[@class="statistics-btm"]/ul//li[ ./span[contains(., "Travelogues")] ]/a/text()').extract_first()
        yield item

Following crawled links

Answers (1)

Related Questions