Arctelix
Arctelix

Reputation: 4576

xpath works in chrome dev tools, but not in scrapy

I am trying to scrape this url with this xpath:

//*[@class="cnnResultItem"]

It works in chrome dev tools console, but the scrapy spider result is [].

I have gone through exhaustive tests testing all the nodes down to the one i want and everything works up to and including //*[@id="mixedresults"]. Everthing after this node results in [].

I am having the exact same issue here with //*[@class="item-title"]. Everything before this node works and everything after and including that node fails.

2014-10-23 03:08:55-0400 [article_spider] INFO: Spider opened
2014-10-23 03:08:55-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2014-10-23 03:08:55-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-10-23 03:08:55-0400 [article_spider] DEBUG: Crawled (200) <GET http://www.c
dc.gov/media/archives.htm> (referer: None)
***base_elem =  ScraperElem object
***base_elem.x_path =  //*[@class="item-title"]
***Found base_objects =  []
2014-10-23 03:08:55-0400 [article_spider] ERROR: No base objects found!
2014-10-23 03:08:55-0400 [article_spider] INFO: Closing spider (finished)
2014-10-23 03:08:55-0400 [article_spider] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 210,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 12999,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 10, 23, 7, 8, 55, 225000),
         'log_count/DEBUG': 7,
         'log_count/ERROR': 1,
         'log_count/INFO': 6,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2014, 10, 23, 7, 8, 55, 53000)}
2014-10-23 03:08:55-0400 [article_spider] INFO: Spider closed (finished)

Any ideas why this is happening would be greatly appreciated.

Upvotes: 1

Views: 715

Answers (1)

Parker
Parker

Reputation: 8851

The url you posted is blank initially, and is populated with data with Javascript. Scrapy does not support dynamic pages, you will need to find out what the javascript is requesting and parse that.

At first glance, it looks like you'll want to query and scrape http://searchapp.cnn.com/cnn-search/query.jsp?query=ebola&ignore=mixed|article|video&start=1&npp=10|10|20&s=all&type=all&sortBy=relevance&primaryType=mixed&csiID=csi1

The results seem to be in json, which will actually be easier to parse. The CDC site you posted is populated with Javascript as well. You can disable JS in Chrome Dev Tools, it will make debugging easier, as you will see what scrapy sees.

Upvotes: 1

Related Questions