Reputation: 25
I am trying to scrape a site with div elements and iteratively, for each div element I want to scrape some data from it and follow the child links it has and scrape more data from them.
Here is the code of quote.py
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl='http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes=response.css('.quote')
for quote in all_div_quotes:
item=QuotesItem()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url=self.baseurl+quote.css('.author+ a::attr(href)').extract_first()
item['title']=title
item['author']=author
item['tags']=tags
request = scrapy.Request(author_details_url,
callback=self.author_born,
meta={'item':item,'next_url':author_details_url})
yield request
def author_born(self, response):
item=response.meta['item']
next_url = response.meta['next_url']
author_born = response.css('.author-born-date::text').extract()
item['author_born']=author_born
yield scrapy.Request(next_url, callback=self.author_birthplace,
meta={'item':item})
def author_birthplace(self,response):
item=response.meta['item']
author_birthplace= response.css('.author-born-location::text').extract()
item['author_birthplace']=author_birthplace
yield item
Here is the code of items.py
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
author_born = scrapy.Field()
author_birthplace = scrapy.Field()
I ran the command scrapy crawl quote -o data.json
, but there was no error message and data.json
was empty. I was expecting to get all the data in its corresponding field.
Can you please help me?
Upvotes: 1
Views: 299
Reputation: 1445
Take a closer look at your logs, you'll be able to find messages like this:
DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein>
Scrapy is automatically managing duplicates and trying not to visit one URL twice(for obvious reasons).
In you case you can add dont_filter = True
to your requests and will see something like this:
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/author/Steve-Martin/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/author/Albert-Einstein/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/author/Marilyn-Monroe/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/author/J-K-Rowling/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/author/Eleanor-Roosevelt/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/author/Andre-Gide/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/author/Thomas-A-Edison/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/author/Jane-Austen/)
Which is kinda strange indeed, because of page yields request to itself.
Overall you could end up with something like this:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl = 'http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes = response.css('.quote')
for quote in all_div_quotes:
item = dict()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url = self.baseurl + quote.css('.author+ a::attr(href)').extract_first()
item['title'] = title
item['author'] = author
item['tags'] = tags
print(item)
# Don't filter = True in case of we get two quotes of a single author.
# This is not optimal though. Better decision will be to save author data to self.storage
# And only visit new author info pages if needed, else take info from saved dict.
request = scrapy.Request(author_details_url,
callback=self.author_info,
meta={'item': item},
dont_filter=True)
yield request
def author_info(self, response):
item = response.meta['item']
author_born = response.css('.author-born-date::text').extract()
author_birthplace = response.css('.author-born-location::text').extract()
item['author_born'] = author_born
item['author_birthplace'] = author_birthplace
yield item
Upvotes: 1