Reputation: 17
I am trying to crawl all the name of the cities of the following web: https://www.zomato.com/directory.
I have tried to used the following xpath.
python
#1st approach:
def parse(self,response):
cities_name = response.xpath('//div//h2//a/text()').extract_first()
items['cities_name'] = cities_name
yield items
#2nd approach:
def parse(self,response):
for city in response.xpath("//div[@class='col-l-5 col-s-8 item pt0 pb5
ml0']"):
l = ItemLoader(item = CountryItem(),selector = city)
l.add_xpath("cities_name",".//h2//a/text()")
yield l.load_item()
yield city
Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc
Upvotes: 0
Views: 205
Reputation: 22440
The main reason you are not getting any result from that page is because the html elements of that site are not well-formed. You can get the results using html5lib
parser. I tried with different parsers but the one I just mentioned did the trick. The following is how you can do it. I used css selector, though.
import scrapy
from bs4 import BeautifulSoup
class ZomatoSpider(scrapy.Spider):
name = "zomato"
start_urls= ['https://www.zomato.com/directory']
def parse(self, response):
soup = BeautifulSoup(response.text, 'html5lib')
for item in soup.select(".row h2 > a"):
yield {"name":item.text}
Upvotes: 0
Reputation: 10666
You have a wrong XPath:
def parse(self,response):
for city_node in response.xpath("//h2"):
l = ItemLoader(item = CountryItem(), selector = city_node)
l.add_xpath("city_name", ".//a/text()")
yield l.load_item()
Upvotes: 0
Reputation: 21406
First thing to note:
Your xpath is a bit too specific. Css classes in html don't always have reliable order. class1 class2
could end up being class2 class1
or even have some broken syntax involved like trailing spaces: class1 class2
.
When you direct match your xpath to [@class="class1 class2"]
there's a high chance that it will fail. Instead you should try to use contains
function.
Second:
You have a tiny error in your cities_name
xpath. In html body its a>h2>text and in your code it's reversed h2>a>text
So that being said I managed to get it working with these css and xpath selectors:
$ parsel "https://www.zomato.com/directory"
> p.mb10>a>h2::text +first
Adelaide
> p.mb10>a>h2::text +len
736
> -xpath
switched to xpath
> //p[contains(@class,"mb10")]/a/h2/text() +first
Adelaide
> //p[contains(@class,"mb10")]/a/h2/text() +len
736
parselcli - https://github.com/Granitosaurus/parsel-cli
Upvotes: 1