Reputation: 31
(Very) New to Python and programming in general
I've been trying to scrape data from more pages/section of the same website with Scrapy
My code works, but it's unreadable and not practical
import scrapy
class SomeSpider(scrapy.Spider):
name = 'some'
allowed_domains = ['https://example.com']
start_urls = [
'https://example.com/Python/?k=books&p=1',
'https://example.com/Python/?k=books&p=2',
'https://example.com/Python/?k=books&p=3',
'https://example.com/Python/?k=tutorials&p=1',
'https://example.com/Python/?k=tutorials&p=2',
'https://example.com/Python/?k=tutorials&p=3',
]
def parse(self, response):
response.selector.remove_namespaces()
info1 = response.css("scrapedinfo1").extract()
info2 = response.css("scrapedinfo2").extract()
for item in zip(scrapedinfo1, scrapedinfo2):
scraped_info = {
'scrapedinfo1': item[0],
'scrapedinfo2': item[1]}
yield scraped_info
How can I improve this?
I'd like to search within a certain amount of categories and pages
I need something like
categories = [books, tutorials, a, b, c, d, e, f]
in a range(1,3)
So that Scrapy would be able to do its job through all categories and pages, while being easy to edit and adapt to other websites
Any ideas are welcome
What I have tried:
categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"
def url_generator():
for category, index in itertools.product(categories, range(1, 4)):
yield base.format(category=category, index=index)
But Scrapy returns
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
Upvotes: 0
Views: 541
Reputation: 31
Solved thanks to start_requests()
and yield scrapy.Request()
Here's the code
import scrapy
import itertools
class SomeSpider(scrapy.Spider):
name = 'somespider'
allowed_domains = ['example.com']
def start_requests(self):
categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"
for category, index in itertools.product(categories, range(1, 4)):
yield scrapy.Request(base.format(category=category, index=index))
def parse(self, response):
response.selector.remove_namespaces()
info1 = response.css("scrapedinfo1").extract()
info2 = response.css("scrapedinfo2").extract()
for item in zip(info1, info2):
scraped_info = {
'scrapedinfo1': item[0],
'scrapedinfo2': item[1],
}
yield scraped_info
Upvotes: 1
Reputation: 142631
You can use method start_requests()
to generate urls at start using yield Request(url)
.
BTW: Later in parse()
you can also use yield Request(url)
to add new url.
I use portal toscrape.com which was created for testing spiders.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['http://quotes.toqoute.com']
#start_urls = []
tags = ['love', 'inspirational', 'life', 'humor', 'books', 'reading']
pages = 3
url_template = 'http://quotes.toscrape.com/tag/{}/page/{}'
def start_requests(self):
for tag in self.tags:
for page in range(self.pages):
url = self.url_template.format(tag, page)
yield scrapy.Request(url)
def parse(self, response):
# test if method was executed
print('url:', response.url)
# --- run it without project ---
from scrapy.crawler import CrawlerProcess
#c = CrawlerProcess({
# 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
# 'FEED_FORMAT': 'csv',
# 'FEED_URI': 'output.csv',
#}
c = CrawlerProcess()
c.crawl(MySpider)
c.start()
Upvotes: 0