WX1505
WX1505

Reputation: 97

How to efficiently crawl a website using Scrapy

I am trying to attempt web-scraping a real estate website using Scrapy and PyCharm, and failing miserably.

Desired Results:

Note: I have attempted web-scraping using BeautifulSoup but it took me around 1-2 min per listing, and around 2-3 hours to scrape all listings using a for loop. I was referred to Scrapy being faster option by a community member. I'm unsure if its cause of the data pipelines or if I can do multi-threading.

All and any help is greatly appreciated.^^

Website sample HTML snippet: This is a picture of the HTML I am trying to scrape.

snippet

Current Scrapy Code: This is what I have so far. When I use the scrapy crawl unegui_apts I cannot seem to get the results I want. I'm so lost.

# -*- coding: utf-8 -*-

# Import library
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request


# Create Spider class
class UneguiApartments(scrapy.Spider):
    name = 'unegui_apts'
    allowed_domains = ['www.unegui.mn']
    custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}}}
    start_urls = [
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/,'
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/2-r/'
        ]
    headers = {
        'user-agent': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
    }

    def parse(self, response):
        self.logger.debug('callback "parse": got response %r' % response)
        cards = response.xpath('//div[@class="list-announcement-block"]')
        for card in cards:
            name = card.xpath('.//meta[@itemprop="name"]/text()').extract_first()
            price = card.xpath('.//meta[@itemprop="price"]/text()').extract_first()
            city = card.xpath('.//meta[@itemprop="areaServed"]/text()').extract_first()
            date = card.xpath('.//*[@class="announcement-block__date"]/text()').extract_first().strip().split(', ')[0]

            request = Request(link, callback=self.parse_details, meta={'name': name,
                                                                       'price': price,
                                                                       'city': city,
                                                                       'date': date})
            yield request

        next_url = response.xpath('//li[@class="pager-next"]/a/@href').get()
        if next_url:
            # go to next page until no more pages
            yield response.follow(next_url, callback=self.parse)

    # main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartments)
    process.start()

Upvotes: 0

Views: 1024

Answers (1)

msenior_
msenior_

Reputation: 2110

Your code has a number of issues:

  1. The start_urls list contains invalid links
  2. You defined your user_agent string in the headers dictionary but you are not using it when yielding requests
  3. Your xpath selectors are incorrect
  4. The next_url is incorrect hence does not yield new requests to the next pages

I have updated your code to fix the issues above as follows:

import scrapy
from scrapy.crawler import CrawlerProcess

# Create Spider class
class UneguiApartments(scrapy.Spider):
    name = 'unegui_apts'
    allowed_domains = ['www.unegui.mn']
    custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}},
                       'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
    start_urls = [
        'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
    ]

    def parse(self, response):
        cards = response.xpath(
            '//li[contains(@class,"announcement-container")]')
        for card in cards:
            name = card.xpath(".//a[@itemprop='name']/@content").extract_first()
            price = card.xpath(".//*[@itemprop='price']/@content").extract_first()
            date = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first()
            city = card.xpath(".//*[@itemprop='areaServed']/@content").extract_first()

            yield {'name': name,
                   'price': price,
                   'city': city,
                   'date': date}

        next_url = response.xpath("//a[contains(@class,'red')]/parent::li/following-sibling::li/a/@href").extract_first()
        if next_url:
            # go to next page until no more pages
            yield response.follow(next_url, callback=self.parse)


    # main driver
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(UneguiApartments)
    process.start()

You run the above spider by executing the command python <filename.py> since you are running a standalone script and not a full blown project.

Sample csv results are as shown in the image below. You will need to clean up the data using pipelines and the scrapy item class. See the docs for more details.

Sample csv of crawled results

Upvotes: 1

Related Questions