baptiste
baptiste

Reputation: 1169

Why Scrapy cannot find href here?

I am trying to get several hrefs that are in html blocks like this one (sorry for formatting but I guess you need everything):

<li class="evt-click" data-tab="yo" data-public="yoyo" data-tracking="1" data-tracking-tag="yo_name" data-tracking-params="{'type': 'yo'}" href="/the/url/i/want">
  <a href="javascript:void(0)">Yo</a>
</li>

My scrapy crawling is able to get the li elements as elmts I want but then when I am trying elmts.xpath('@href'), no link is returned.

I don't get it but I am 2 weeks old with Scrapy !

Upvotes: 3

Views: 927

Answers (1)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

If you want the hrefs from the li's with the class evt-click you can use the following xpath:

xpath('//li[@class="evt-click"]/@href'))

In your own example you need:

 xpath("./@href")

The reason neither work is because what you are looking for does not exist in the html in the link you provided, there are 11 li class="evt-click" and none contain any href bar the js inside the a tag:

enter image description here

You can use scrapy-splash to allow the page to render fully to get the dynamically generated data, you need to install it as per the link instructions:

Add to setting.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

Start a docker instance:

docker run -p 8050:8050 scrapinghub/splash:

Then this is enough to get the data you want:

import scrapy

class MySpider(scrapy.Spider):
    name = "deez"
    start_urls = ["http://www.deezer.com/profile/154723101"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 1}
                }
            })

    def parse(self, response):
        print(response.xpath("//li[@class='evt-click']").extract())

Output:

$ scrapy crawl deez
.............................
2016-03-20 23:01:12 [scrapy] DEBUG: Crawled (200) <POST http://127.0.0.1:8050/render.html> (referer: None)
[u'/profile/154723101/loved', u'/profile/154723101/playlists', u'/profile/154723101/albums', u'/profile/154723101/artists', u'/profile/154723101/radios', u'/profile/154723101/following', u'/profile/154723101/followers']

selenium is also another option.

Upvotes: 3

Related Questions