New learner
New learner

Reputation: 17

Xpath is correct but no result after scraping

I am trying to crawl all the name of the cities of the following web: https://www.zomato.com/directory.

I have tried to used the following xpath.

python
#1st approach:
def parse(self,response):
    cities_name = response.xpath('//div//h2//a/text()').extract_first()
    items['cities_name'] = cities_name
    yield items 
 #2nd approach:

def parse(self,response):
 for city in response.xpath("//div[@class='col-l-5 col-s-8 item pt0 pb5 
   ml0']"):
        l = ItemLoader(item = CountryItem(),selector = city)
        l.add_xpath("cities_name",".//h2//a/text()")
        yield l.load_item()
        yield city

Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc

Upvotes: 0

Views: 205

Answers (3)

SIM
SIM

Reputation: 22440

The main reason you are not getting any result from that page is because the html elements of that site are not well-formed. You can get the results using html5lib parser. I tried with different parsers but the one I just mentioned did the trick. The following is how you can do it. I used css selector, though.

import scrapy
from bs4 import BeautifulSoup

class ZomatoSpider(scrapy.Spider):
    name = "zomato"

    start_urls= ['https://www.zomato.com/directory']

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'html5lib')
        for item in soup.select(".row h2 > a"):
            yield {"name":item.text}

Upvotes: 0

gangabass
gangabass

Reputation: 10666

You have a wrong XPath:

def parse(self,response):
 for city_node in response.xpath("//h2"):
        l = ItemLoader(item = CountryItem(), selector = city_node)
        l.add_xpath("city_name", ".//a/text()")
        yield l.load_item()

Upvotes: 0

Granitosaurus
Granitosaurus

Reputation: 21406

First thing to note:
Your xpath is a bit too specific. Css classes in html don't always have reliable order. class1 class2 could end up being class2 class1 or even have some broken syntax involved like trailing spaces: class1 class2.

When you direct match your xpath to [@class="class1 class2"] there's a high chance that it will fail. Instead you should try to use contains function.

Second:
You have a tiny error in your cities_name xpath. In html body its a>h2>text and in your code it's reversed h2>a>text

So that being said I managed to get it working with these css and xpath selectors:

$ parsel "https://www.zomato.com/directory"                                                                           
> p.mb10>a>h2::text +first                                                                                            
Adelaide
> p.mb10>a>h2::text +len                                                                                              
736
> -xpath                                                                                                              
switched to xpath
> //p[contains(@class,"mb10")]/a/h2/text() +first                                                                     
Adelaide
> //p[contains(@class,"mb10")]/a/h2/text() +len                                                                       
736

parselcli - https://github.com/Granitosaurus/parsel-cli

Upvotes: 1

Related Questions