Rawhide
Rawhide

Reputation: 31

How to scrape multiple pages from a website?

(Very) New to Python and programming in general

I've been trying to scrape data from more pages/section of the same website with Scrapy

My code works, but it's unreadable and not practical

import scrapy

class SomeSpider(scrapy.Spider):
    name = 'some'
    allowed_domains = ['https://example.com']
    start_urls = [
        'https://example.com/Python/?k=books&p=1',
        'https://example.com/Python/?k=books&p=2',
        'https://example.com/Python/?k=books&p=3',
        'https://example.com/Python/?k=tutorials&p=1',
        'https://example.com/Python/?k=tutorials&p=2',
        'https://example.com/Python/?k=tutorials&p=3',
     ]

     def parse(self, response):
         response.selector.remove_namespaces()

         info1 = response.css("scrapedinfo1").extract()
         info2 = response.css("scrapedinfo2").extract()

         for item in zip(scrapedinfo1, scrapedinfo2):
           scraped_info = {
              'scrapedinfo1': item[0],
              'scrapedinfo2': item[1]}

              yield scraped_info

How can I improve this?

I'd like to search within a certain amount of categories and pages

I need something like

categories = [books, tutorials, a, b, c, d, e, f] 
in a range(1,3)

So that Scrapy would be able to do its job through all categories and pages, while being easy to edit and adapt to other websites

Any ideas are welcome

What I have tried:

categories = ["books", "tutorials"]
base = "https://example.com/Python/?k={category}&p={index}"

def url_generator():
    for category, index in itertools.product(categories, range(1, 4)):
        yield base.format(category=category, index=index)

But Scrapy returns

[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)

Upvotes: 0

Views: 541

Answers (2)

Rawhide
Rawhide

Reputation: 31

Solved thanks to start_requests() and yield scrapy.Request()

Here's the code

import scrapy
import itertools


class SomeSpider(scrapy.Spider):
    name = 'somespider'
    allowed_domains = ['example.com']

    def start_requests(self):
        categories = ["books", "tutorials"]
        base = "https://example.com/Python/?k={category}&p={index}"

        for category, index in itertools.product(categories, range(1, 4)):
            yield scrapy.Request(base.format(category=category, index=index))

    def parse(self, response):
        response.selector.remove_namespaces()

        info1 = response.css("scrapedinfo1").extract()
        info2 = response.css("scrapedinfo2").extract()

        for item in zip(info1, info2):
            scraped_info = {
                'scrapedinfo1': item[0],
                'scrapedinfo2': item[1],
            }

            yield scraped_info

Upvotes: 1

furas
furas

Reputation: 142631

You can use method start_requests() to generate urls at start using yield Request(url).

BTW: Later in parse() you can also use yield Request(url) to add new url.

I use portal toscrape.com which was created for testing spiders.

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['http://quotes.toqoute.com']

    #start_urls = []

    tags = ['love', 'inspirational', 'life', 'humor', 'books', 'reading']
    pages = 3
    url_template = 'http://quotes.toscrape.com/tag/{}/page/{}'

    def start_requests(self):

        for tag in self.tags:
            for page in range(self.pages):
                url = self.url_template.format(tag, page)
                yield scrapy.Request(url)


    def parse(self, response):
        # test if method was executed
        print('url:', response.url)

# --- run it without project ---

from scrapy.crawler import CrawlerProcess

#c = CrawlerProcess({
#    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
#    'FEED_FORMAT': 'csv',
#    'FEED_URI': 'output.csv',
#}

c = CrawlerProcess()
c.crawl(MySpider)
c.start()

Upvotes: 0

Related Questions