Reputation: 5
I'm a newbie on Scrapy and Python. I want to do the following:
Access to an url and get all the links containing "shop/products" as part of the url. The links look like: "http://www.example.com/shop/products/category-name"
Scrap an url of start_urls and get the number of total products, TOTAL. On the code TOTAL = num_items_per_category.
At the end, add "?sort=Top&size=12&start=PARAM" to the url. PARAM must be incremented by 12 on each iteration just in case PARAM <= TOTAL. The final url would be "http://www.example.com/shop/products/category-name?sort=Top&size=12&start=PARAM"
Get another url from the start_urls generated and start step 2 again.
Here is my spider code:
import scrapy
import re
import datetime
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.http.request import Request
class MySpider(CrawlSpider):
name = 'my_spider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/shop/products']
rules = (
Rule(LxmlLinkExtractor(
restrict_xpaths=('.//li[@class="item"]/a')),
follow=False,
callback='parse_list'
),
)
def parse_list(self, response):
SET_SELECTOR = '.product'
for item in response.css(ITEM_SELECTOR):
NAME_SELECTOR = 'div[@class="product"]/h2/a/@title'
yield {
'name': item.xpath(NAME_SELECTOR).extract_first()
}
NUM_ITEMS_PER_CATEGORY_SELECTOR = 'div[@id="search"]/@data-count'
num_items_per_category = item.xpath(NUM_ITEMS_PER_CATEGORY_SELECTOR).extract_first()
nipc = int(0 if num_items_per_category is None else num_items_per_category)
try:
next_start = response.meta["next_start"]
except KeyError:
next_start = 0
if next_start <= nipc:
yield scrapy.Request(
response.urljoin('%s?sort=Top&size=12&start=%s' % (response.url, next_start)),
meta={"next_start": next_start + 12},
dont_filter=True,
callback = self.parse_list
)
Problems are:
I don't know if it exists any css selector or regex to use in a Rule to select every link I want. On the code, I'm accessing to a path where I know there are some of my wanted links, but still there are more on the page.
The code is not working as I'm expecting. It seems next_start is not incrementing by 12 on each iteration. The code are getting just the first 12 elements of each url on the start_urls list generated. Am I using meta variables correctly? Or may be I need another first scrap of each category page to get the TOTAL count before I can use it to iterate over it? Or maybe I need another approach using start_requests... What do you think?
Upvotes: 0
Views: 1104
Reputation: 1888
What your spider exactly does is visits the url http://www.example.com/shop/products
, extracts all the links inside <li class="item">
elements and fetches all of them using parse_list
callback. As I see it is not the behavior you're awaiting for - instead you should use some start url containing seeding URLs and Extractor with allow=r"shop/products"
in the Rule.
Also this part '%s?sort=Top&size=12&start=%s' % (response.url, next_start)
is wrong because response.url contains full URL including GET parameters, thus every time you append the part with parameters to existing parameters string like this ?sort=Top&size=12&start=0?sort=Top&size=12&start=12?sort=Top&size=12&start=24
. Clean up parameters from url before appending new string or just use FormRequest
as more convenient way to pass parameters.
By the way Scrapy has very handy interactive console for debugging purposes which you can invoke from any part of spider using scrapy.shell.inspect_response
.
Upvotes: 0