Reputation: 4576
I am trying to scrape this url with this xpath:
//*[@class="cnnResultItem"]
It works in chrome dev tools console, but the scrapy spider result is [].
I have gone through exhaustive tests testing all the nodes down to the one i want and everything works up to and including //*[@id="mixedresults"]
. Everthing after this node results in [].
I am having the exact same issue here with //*[@class="item-title"]
. Everything before this node works and everything after and including that node fails.
2014-10-23 03:08:55-0400 [article_spider] INFO: Spider opened
2014-10-23 03:08:55-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2014-10-23 03:08:55-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-10-23 03:08:55-0400 [article_spider] DEBUG: Crawled (200) <GET http://www.c
dc.gov/media/archives.htm> (referer: None)
***base_elem = ScraperElem object
***base_elem.x_path = //*[@class="item-title"]
***Found base_objects = []
2014-10-23 03:08:55-0400 [article_spider] ERROR: No base objects found!
2014-10-23 03:08:55-0400 [article_spider] INFO: Closing spider (finished)
2014-10-23 03:08:55-0400 [article_spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 210,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12999,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 10, 23, 7, 8, 55, 225000),
'log_count/DEBUG': 7,
'log_count/ERROR': 1,
'log_count/INFO': 6,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 10, 23, 7, 8, 55, 53000)}
2014-10-23 03:08:55-0400 [article_spider] INFO: Spider closed (finished)
Any ideas why this is happening would be greatly appreciated.
Upvotes: 1
Views: 715
Reputation: 8851
The url you posted is blank initially, and is populated with data with Javascript. Scrapy does not support dynamic pages, you will need to find out what the javascript is requesting and parse that.
At first glance, it looks like you'll want to query and scrape
http://searchapp.cnn.com/cnn-search/query.jsp?query=ebola&ignore=mixed|article|video&start=1&npp=10|10|20&s=all&type=all&sortBy=relevance&primaryType=mixed&csiID=csi1
The results seem to be in json, which will actually be easier to parse. The CDC site you posted is populated with Javascript as well. You can disable JS in Chrome Dev Tools, it will make debugging easier, as you will see what scrapy sees.
Upvotes: 1