Reputation: 1169
I am trying to get several hrefs that are in html blocks like this one (sorry for formatting but I guess you need everything):
<li class="evt-click" data-tab="yo" data-public="yoyo" data-tracking="1" data-tracking-tag="yo_name" data-tracking-params="{'type': 'yo'}" href="/the/url/i/want">
<a href="javascript:void(0)">Yo</a>
</li>
My scrapy crawling is able to get the li
elements as elmts
I want but then when I am trying elmts.xpath('@href')
, no link is returned.
I don't get it but I am 2 weeks old with Scrapy !
Upvotes: 3
Views: 927
Reputation: 180391
If you want the hrefs from the li's with the class evt-click you can use the following xpath:
xpath('//li[@class="evt-click"]/@href'))
In your own example you need:
xpath("./@href")
The reason neither work is because what you are looking for does not exist in the html in the link you provided, there are 11 li class="evt-click"
and none contain any href bar the js inside the a tag:
You can use scrapy-splash to allow the page to render fully to get the dynamically generated data, you need to install it as per the link instructions:
Add to setting.py:
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
Start a docker instance:
docker run -p 8050:8050 scrapinghub/splash:
Then this is enough to get the data you want:
import scrapy
class MySpider(scrapy.Spider):
name = "deez"
start_urls = ["http://www.deezer.com/profile/154723101"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 1}
}
})
def parse(self, response):
print(response.xpath("//li[@class='evt-click']").extract())
Output:
$ scrapy crawl deez
.............................
2016-03-20 23:01:12 [scrapy] DEBUG: Crawled (200) <POST http://127.0.0.1:8050/render.html> (referer: None)
[u'/profile/154723101/loved', u'/profile/154723101/playlists', u'/profile/154723101/albums', u'/profile/154723101/artists', u'/profile/154723101/radios', u'/profile/154723101/following', u'/profile/154723101/followers']
selenium is also another option.
Upvotes: 3