Reputation: 1
How can I scrape the zaubee.com website to extract business details from each restaurant's page when the href attribute is set to "#" in scrapy??
I'm presently working on a web scraping project that will gather company information from the zaubee.com website. However, the href parameter for each restaurant link is set to #
, preventing me from visiting the various restaurant sites and gathering the needed data.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class zaubeeSpider(scrapy.Spider):
name = 'zaubeeerestaurant'
allowed_domains = ['www.zaubee.com']
start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
def parse(self, response):
restaurantlink = response.xpath("//div[@class='search-result__title-wrapper']/h2")
for restaurant in restaurantlink:
name= restaurant.xpath(".//text()").get()
link = restaurant.xpath(".//@href").get()
yield {
'name':name,
'link':link
}
yield response.follow(url=link,callback =self.parse_restaurant)
def parse_restaurant(self,response):
name = response.xpath("//h1[@class='postcard__title postcard__title--claimed']/text()").get()
website = response.xpath("(//a[@class='profile__website__link']/@href)[1]").get()
address = response.xpath("(//address[@class='profile__address--compact']/text())[1]").get()
yield{
'name':name,
"website":website,
'address':address
}
I've previously created a scraping solution using Scrapy, but I need help overcoming this challenge. What method or workaround can I use to visit each restaurant's page and get the necessary information?
OUTPUT FOR ONE ENTRY:
2023-06-04 23:38:10 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': 'Restaurants in Fredonia New York', 'link': '#'}
When it try to get inside link shown below
2023-06-04 23:38:12 [scrapy.core.scraper] DEBUG: Scraped from <200 [https://zaubee.com/category/restaurant-in-fredonia-hclq6jom](https://zaubee.com/category/restaurant-in-fredonia-hclq6jom)>
{'name': None, 'website': None, 'address': None}
I'm trying to get inside each restaurant link and collect restaurant name, address, telephone, timing for particular link.
Upvotes: 0
Views: 102
Reputation: 4822
It's just that your xpath selectors are wrong.
import scrapy
import unicodedata
import re
class zaubeeSpider(scrapy.Spider):
name = 'zaubeeerestaurant'
start_urls = ['https://zaubee.com/category/restaurant-in-fredonia-hclq6jom']
allowed_domains = ['zaubee.com']
def parse(self, response):
restaurants = response.xpath('//div[@data-value]')
for restaurant in restaurants:
name = restaurant.xpath('.//h3/text()[not(span)]').getall()
name = ''.join(name).strip()
link = restaurant.xpath(".//a/@href").get(default='')
yield {
'name': name,
'link': response.urljoin(link)
}
yield response.follow(url=link, callback=self.parse_restaurant)
def parse_restaurant(self,response):
name = response.xpath('//h1/text()').get()
website = response.xpath('//a[@rel]/@href').get(default='')
website = re.sub(r'//', r'https://', website)
address = response.xpath('//div[contains(@class, "address")]/span[last()]/text()').get(default='')
address = unicodedata.normalize("NFKD", address).replace('\n', ' ').strip()
yield{
'name': name,
"website": website,
'address': address
}
Upvotes: 0