Licey Soremap
Licey Soremap

Reputation: 1

How can I navigate and extract restaurant details from zaubee.com when the href attribute is set to "#" for each restaurant link?

How can I scrape the zaubee.com website to extract business details from each restaurant's page when the href attribute is set to "#" in scrapy??

I'm presently working on a web scraping project that will gather company information from the zaubee.com website. However, the href parameter for each restaurant link is set to #, preventing me from visiting the various restaurant sites and gathering the needed data.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    allowed_domains = ['www.zaubee.com']
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']

def parse(self, response):
    restaurantlink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
    for restaurant in restaurantlink:
        name= restaurant.xpath(".//text()").get()
        link = restaurant.xpath(".//@href").get()
        yield {
            'name':name,
            'link':link
        }
        yield response.follow(url=link,callback =self.parse_restaurant)


def parse_restaurant(self,response):
    name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
    website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
    address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()

    yield{
        'name':name,
        "website":website,
        'address':address
    }

I've previously created a scraping solution using Scrapy, but I need help overcoming this challenge. What method or workaround can I use to visit each restaurant's page and get the necessary information?

OUTPUT FOR ONE ENTRY:

2023-06-04 23:38:10 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': 'Restaurants in Fredonia New York', 'link': '#'}

When it try to get inside link shown below

2023-06-04 23:38:12 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': None, 'website': None, 'address': None}

I'm trying to get inside each restaurant link and collect restaurant name, address, telephone, timing for particular link.

Upvotes: 0

Views: 102

Answers (1)

SuperUser
SuperUser

Reputation: 4822

It's just that your xpath selectors are wrong.

import scrapy
import unicodedata
import re


class zaubeeSpider(scrapy.Spider):
    name = 'zaubeeerestaurant'
    start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
    allowed_domains = ['zaubee.com']

    def parse(self, response):
        restaurants = response.xpath('//div[@data-value]')
        for restaurant in restaurants:
            name = restaurant.xpath('.//h3/text()[not(span)]').getall()
            name = ''.join(name).strip()
            link = restaurant.xpath(".//a/@href").get(default='')
            yield {
                'name': name,
                'link': response.urljoin(link)
            }
            yield response.follow(url=link, callback=self.parse_restaurant)

    def parse_restaurant(self,response):
        name = response.xpath('//h1/text()').get()
        website = response.xpath('//a[@rel]/@href').get(default='')
        website = re.sub(r'//', r'https://', website)
        address = response.xpath('//div[contains(@class, "address")]/span[last()]/text()').get(default='')
        address = unicodedata.normalize("NFKD", address).replace('\n', ' ').strip()

        yield{
            'name': name,
            "website": website,
            'address': address
        }

Upvotes: 0

Related Questions