How to efficiently crawl a website using Scrapy

Question

I am trying to attempt web-scraping a real estate website using Scrapy and PyCharm, and failing miserably.

Desired Results:

Scrape 1 base URL (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/), but 5 different internal URLs (https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/{**i**}-r/), where {i} = 1,2,3,4,5
Crawl all pages in each internal URL or using the base URL
Get all href links and crawl all href link and get span tag data from inside each href link.
Scrape around 5,000-7,000 unique listings as efficiently and fast as possible.
Output data into a CSV file while keeping Cyrillic characters.

Note: I have attempted web-scraping using BeautifulSoup but it took me around 1-2 min per listing, and around 2-3 hours to scrape all listings using a for loop. I was referred to Scrapy being faster option by a community member. I'm unsure if its cause of the data pipelines or if I can do multi-threading.

All and any help is greatly appreciated.^^

Website sample HTML snippet: This is a picture of the HTML I am trying to scrape.

Current Scrapy Code: This is what I have so far. When I use the scrapy crawl unegui_apts I cannot seem to get the results I want. I'm so lost.

# -*- coding: utf-8 -*-

# Import library
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request


# Create Spider class
class UneguiApartments(scrapy.Spider):
    name = 'unegui_apts'
    allowed_domains = ['www.unegui.mn']
    custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}}}
    start_urls = [
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/,'
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/2-r/'
        ]
    headers = {
        'user-agent': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
    }

    def parse(self, response):
        self.logger.debug('callback "parse": got response %r' % response)
        cards = response.xpath('//div[@class="list-announcement-block"]')
        for card in cards:
            name = card.xpath('.//meta[@itemprop="name"]/text()').extract_first()
            price = card.xpath('.//meta[@itemprop="price"]/text()').extract_first()
            city = card.xpath('.//meta[@itemprop="areaServed"]/text()').extract_first()
            date = card.xpath('.//*[@class="announcement-block__date"]/text()').extract_first().strip().split(', ')[0]

            request = Request(link, callback=self.parse_details, meta={'name': name,
                                                                       'price': price,
                                                                       'city': city,
                                                                       'date': date})
            yield request

        next_url = response.xpath('//li[@class="pager-next"]/a/@href').get()
        if next_url:
            # go to next page until no more pages
            yield response.follow(next_url, callback=self.parse)

    # main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartments)
    process.start()

msenior_ · Accepted Answer

Your code has a number of issues:

The start_urls list contains invalid links
You defined your user_agent string in the headers dictionary but you are not using it when yielding requests
Your xpath selectors are incorrect
The next_url is incorrect hence does not yield new requests to the next pages

I have updated your code to fix the issues above as follows:

import scrapy
from scrapy.crawler import CrawlerProcess

# Create Spider class
class UneguiApartments(scrapy.Spider):
    name = 'unegui_apts'
    allowed_domains = ['www.unegui.mn']
    custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}},
                       'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
    start_urls = [
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
    ]

    def parse(self, response):
        cards = response.xpath(
            '//li[contains(@class,"announcement-container")]')
        for card in cards:
            name = card.xpath(".//a[@itemprop='name']/@content").extract_first()
            price = card.xpath(".//*[@itemprop='price']/@content").extract_first()
            date = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first()
            city = card.xpath(".//*[@itemprop='areaServed']/@content").extract_first()

            yield {'name': name,
                   'price': price,
                   'city': city,
                   'date': date}

        next_url = response.xpath("//a[contains(@class,'red')]/parent::li/following-sibling::li/a/@href").extract_first()
        if next_url:
            # go to next page until no more pages
            yield response.follow(next_url, callback=self.parse)


    # main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartments)
    process.start()

You run the above spider by executing the command python since you are running a standalone script and not a full blown project.

Sample csv results are as shown in the image below. You will need to clean up the data using pipelines and the scrapy item class. See the docs for more details.

How to efficiently crawl a website using Scrapy

Answers (1)

Related Questions