Reputation: 10383
I want to crawl this pages for all the exibitors:
https://greenbuildexpo.com/Attendee/Expohall/Exhibitors
But scrapy doesn't load the content, what I'm doing now it using selenium to load the page and search the links with scrapy:
url = 'https://greenbuildexpo.com/Attendee/Expohall/Exhibitors'
driver_1 = webdriver.Firefox()
driver_1.get(url)
content = driver_1.page_source
response = TextResponse(url='',body=content,encoding='utf-8')
print len(set(response.xpath('//*[contains(@href,"Attendee/")]//@href').extract()))
The site doesn't seem to make any new request when "next" button is pressed so I was hoping to get all links at one, but I'm only getting 43 links with that code. They should be around 500.
Now I'm trying to crawl the page by pressing the "next" button:
for i in range(10):
xpath = '//*[@id="pagingNormalView"]/ul/li[15]'
driver_1.find_element_by_xpath(xpath).click()
but I got an error:
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"//*[@id=\"pagingNormalView\"]/ul/li[15]"}
Stacktrace:
Upvotes: 1
Views: 748
Reputation: 473863
You don't need selenium
for that, there is an XHR request to get all exhibitors, simulate it, demo from the Scrapy Shell:
$ scrapy shell https://greenbuildexpo.com/Attendee/Expohall/Exhibitors
In [1]: fetch("https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors")
2016-10-13 12:45:46 [scrapy] DEBUG: Crawled (200) <GET https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors> (referer: None)
In [2]: import json
In [3]: data = json.loads(response.body)
In [4]: len(data["Data"])
Out[4]: 541
# printing booth number for demonstration purposes
In [5]: for item in data["Data"]:
...: print(item["BoothNumber"])
...:
2309
2507
...
1243
2203
943
Upvotes: 3