Reputation: 97
I am trying to attempt web-scraping a real estate website using Scrapy and PyCharm, and failing miserably.
Desired Results:
Note: I have attempted web-scraping using BeautifulSoup but it took me around 1-2 min per listing, and around 2-3 hours to scrape all listings using a for loop. I was referred to Scrapy being faster option by a community member. I'm unsure if its cause of the data pipelines or if I can do multi-threading.
All and any help is greatly appreciated.^^
Website sample HTML snippet: This is a picture of the HTML I am trying to scrape.
Current Scrapy Code: This is what I have so far. When I use the scrapy crawl unegui_apts I cannot seem to get the results I want. I'm so lost.
# -*- coding: utf-8 -*-
# Import library
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = 'unegui_apts'
allowed_domains = ['www.unegui.mn']
custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}}}
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/1-r/,'
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/2-r/'
]
headers = {
'user-agent': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"
}
def parse(self, response):
self.logger.debug('callback "parse": got response %r' % response)
cards = response.xpath('//div[@class="list-announcement-block"]')
for card in cards:
name = card.xpath('.//meta[@itemprop="name"]/text()').extract_first()
price = card.xpath('.//meta[@itemprop="price"]/text()').extract_first()
city = card.xpath('.//meta[@itemprop="areaServed"]/text()').extract_first()
date = card.xpath('.//*[@class="announcement-block__date"]/text()').extract_first().strip().split(', ')[0]
request = Request(link, callback=self.parse_details, meta={'name': name,
'price': price,
'city': city,
'date': date})
yield request
next_url = response.xpath('//li[@class="pager-next"]/a/@href').get()
if next_url:
# go to next page until no more pages
yield response.follow(next_url, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
Upvotes: 0
Views: 1024
Reputation: 2110
Your code has a number of issues:
start_urls
list contains invalid linksuser_agent
string in the headers
dictionary but you are not using it when yielding request
snext_url
is incorrect hence does not yield new requests to the next pagesI have updated your code to fix the issues above as follows:
import scrapy
from scrapy.crawler import CrawlerProcess
# Create Spider class
class UneguiApartments(scrapy.Spider):
name = 'unegui_apts'
allowed_domains = ['www.unegui.mn']
custom_settings = {'FEEDS': {'results1.csv': {'format': 'csv'}},
'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
start_urls = [
'https://www.unegui.mn/l-hdlh/l-hdlh-zarna/oron-suuts-zarna/'
]
def parse(self, response):
cards = response.xpath(
'//li[contains(@class,"announcement-container")]')
for card in cards:
name = card.xpath(".//a[@itemprop='name']/@content").extract_first()
price = card.xpath(".//*[@itemprop='price']/@content").extract_first()
date = card.xpath("normalize-space(.//div[contains(@class,'announcement-block__date')]/text())").extract_first()
city = card.xpath(".//*[@itemprop='areaServed']/@content").extract_first()
yield {'name': name,
'price': price,
'city': city,
'date': date}
next_url = response.xpath("//a[contains(@class,'red')]/parent::li/following-sibling::li/a/@href").extract_first()
if next_url:
# go to next page until no more pages
yield response.follow(next_url, callback=self.parse)
# main driver
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(UneguiApartments)
process.start()
You run the above spider by executing the command python <filename.py>
since you are running a standalone script and not a full blown project.
Sample csv results are as shown in the image below. You will need to clean up the data using pipelines
and the scrapy item
class. See the docs for more details.
Upvotes: 1