Reputation: 153
Having the following spider:
import scrapy
from final.items import FinalItem
class ScrapeMovies(scrapy.Spider):
name='final'
start_urls = [
'https://www.trekearth.com/members/page1.htm?sort_by=md'
]
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
website = row.xpath('./td[2]//a/@href/text()').extract_first()
request = scrapy.Request(website,
callback=self.parse_page2)
yield request
def parse_page2(self, response):
request.meta['item'] = item
item['travelog'] = response.xpath('string(//div[@class="statistics-btm"]/ul//li[position()=4]/a)').extract_first()
yield item
# next_page=response.xpath('//div[@class="page-nav-btm"]/ul/li[last()]/a/@href').extract_first()
# if next_page is not None:
# next_page=response.urljoin(next_page)
# yield scrapy.Request(next_page, callback=self.parse)
I have a table i want to scrape name (and other informations as well) from this table and then follow the link to each of the users profile and then gather some data from those profiles and then merge it into a single item.
Then i want to return to the main table and go to a next page of it till the end (final part of the code is responsible for that, it was commented out for convenience).
Code i wrote does not work properly. Error i have is:
TypeError: Request url must be str or unicode, got NoneType:
How to fix this? How to make it to properly crawl all of the data?
Upvotes: 0
Views: 50
Reputation: 10666
You need this code (your XPath expressions are wrong):
def parse(self, response):
for row in response.xpath('//table[@class="member-table"]//tr[position() > 1]'):
item = FinalItem()
item['name'] = row.xpath('./td[2]//a/text()').extract_first()
profile_url = row.xpath('./td[2]//a/@href').extract_first()
yield scrapy.Request( url=response.urljoin(profile_url), callback=self.parse_profile, meta={"item": item } )
next_page_url = response.xpath('//div[@class="page-nav-btm"]//li[last()]/a/@href').extract_first()
if next_page_url:
yield scrapy.Request( url=response.urljoin(next_page_url), callback=self.parse )
def parse_profile(self, response):
item = response.meta['item']
item['travelog'] = response.xpath('//div[@class="statistics-btm"]/ul//li[ ./span[contains(., "Travelogues")] ]/a/text()').extract_first()
yield item
Upvotes: 1