Reputation: 11
I am working on my first scrapy project and starting with a fairly simple website stockx.
I would like to scrape the different categories of items. If I use the below URLs as my start_urls. How do I parse through each start URL?
https://stockx.com/sneakers',https://stockx.com/streetwear',https://stockx.com/collectibles', https://stockx.com/handbags',https://stockx.com/watches
The product page is typically structured as the following:
I am trying to read through the documentation on this topic but couldn't quite follow it.
I know the below isn't right because I'm forcing a list of result URLs, just not sure how the multiple start_urls should be processed in the first parse.
def parse(self, response):
#obtain number of pages per product category
text = list(map(lambda x: x.split('='),
response.xpath('//a[@class="PagingButton__PaginationButton-sc-1o2t560-
0 eZnUxt"]/@href').extract()))
total_pages = int(text[-1][-1])
#compile a list of URLs for each result page
cat =['sneakers','streetwear','collectibles','handbags','watches']
cat = ['https://stockx.com/{}'.format(x) for x in cat]
lst=[]
for x in cat:
for y in range (1,total_pages+1):
result_urls=lst.append(x+'?page={}'.format(y))
for url in result_urls[7:9]:
# print('Lets try: ', url)
yield Request(url=url, callback=self.parse_results)
Upvotes: 1
Views: 5553
Reputation: 187
You can use a list comprehension in place of the initial start_urls list. For instance...
class myScraper(scrapy.Spider):
name = 'movies'
abc = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
url= "amazon.com/movies/{x}"
start_urls = [url.format(x) for x in abc]
Note: Please don't run this, this was just for inspiration purposes. I did something like this in a project a while back(and was too lazy look it up again) and it worked. It saves you the time of having to create a custom start_requests function.
The url I used does not exist its just an example of something you can do.
The main idea here is to use a list comprehension in place of the default start_urls list so that you dont have to make a fancy function.
Upvotes: 0
Reputation: 8154
Simple solution is use using start_urls
: https://doc.scrapy.org/en/1.4/topics/spiders.html#scrapy.spiders.Spider.start_urls
class MLBoddsSpider(BaseSpider):
name = "stockx.com"
allowed_domains = ["stockx.com"]
start_urls = [
"https://stockx.com/watches",
"https://stockx.com/collectibles",
]
def parse(self, response):
................
........
you can even control the start_requests
.
Upvotes: 2
Reputation: 638
Try something like this -
class ctSpider(Spider):
name = "stack"
def start_requests(self):
for d in [URLS]:
yield Request(d,callback=self.parse)
...
Upvotes: 2