Reputation: 159
I need to create a list of website url. I use Scrapy 2.3.0 for this. The problem is that the result ('item_scraped_count') is 63 links, but I know there are more.
Is there any way to process deeper levels and pick up the url?
My code below:
from scrapy.spiders import CrawlSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from scrapy import Item
from scrapy import Field
class UrlItem(Item):
url = Field()
class RetriveUrl(CrawlSpider):
name = 'retrive_url'
allowed_domains = ['example.com']
start_urls = ['https://www.example.com']
rules = (
Rule(LinkExtractor(), callback='parse_url'),
)
def parse_url(self, response):
item = UrlItem()
item['url'] = response.url
return item
Upvotes: 1
Views: 38
Reputation: 3740
You should allow the crawl to follow to the deeper levels. Try this:
Rule(LinkExtractor(), callback='parse_url', follow=True),
follow
is a boolean which specifies if links should be followed from each response extracted with this rule. Ifcallback
isNone
follow defaults toTrue
, otherwise it defaults toFalse
.
Upvotes: 3