Use scrapy with selenium for dynamic pages

Question

I want to crawl this pages for all the exibitors:

https://greenbuildexpo.com/Attendee/Expohall/Exhibitors

But scrapy doesn't load the content, what I'm doing now it using selenium to load the page and search the links with scrapy:

url = 'https://greenbuildexpo.com/Attendee/Expohall/Exhibitors'

driver_1 = webdriver.Firefox()
driver_1.get(url)
content = driver_1.page_source

response = TextResponse(url='',body=content,encoding='utf-8')
print len(set(response.xpath('//*[contains(@href,"Attendee/")]//@href').extract()))

The site doesn't seem to make any new request when "next" button is pressed so I was hoping to get all links at one, but I'm only getting 43 links with that code. They should be around 500.

Now I'm trying to crawl the page by pressing the "next" button:

for i in range(10):
    xpath = '//*[@id="pagingNormalView"]/ul/li[15]'
    driver_1.find_element_by_xpath(xpath).click()

but I got an error:

File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 192, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"//*[@id=\"pagingNormalView\"]/ul/li[15]"}
Stacktrace:

alecxe · Accepted Answer

You don't need selenium for that, there is an XHR request to get all exhibitors, simulate it, demo from the Scrapy Shell:

$ scrapy shell https://greenbuildexpo.com/Attendee/Expohall/Exhibitors
In [1]: fetch("https://greenbuildexpo.com/Attendee/ExpoHall/GetAllExhibitors")
2016-10-13 12:45:46 [scrapy] DEBUG: Crawled (200)  (referer: None)

In [2]: import json

In [3]: data = json.loads(response.body)

In [4]: len(data["Data"])
Out[4]: 541

# printing booth number for demonstration purposes
In [5]: for item in data["Data"]:
   ...:     print(item["BoothNumber"])
   ...:  
2309
2507
...
1243
2203
943

Use scrapy with selenium for dynamic pages

Answers (1)

Related Questions