Reputation: 827
I've started learning Python,
and I'm loving it so far. I keep looking at different libraries and what not. So I stumbled upon Scrapy
and thought I would give it a try. I wanted to get all the links from daylerees
colour schemes (from github) and dump them somewhere for a quick access.
So I did this:
import scrapy
class ThemeItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
class ThemeSpider(scrapy.Spider):
name = 'themespider'
start_urls = ['https://github.com/daylerees/colour-schemes/tree/master/jetbrains']
def parse(self, response):
for sel in response.xpath('//a[@class="js-directory-link"]'):
url = ThemeItem()
url['name'] = sel.xpath('text()')
url['link'] = sel.xpath('@href')
yield url
And it is not outputting anything at all. Any guidance would be much appreciated.
I'm running it like this:
scrapy runspider spider.py
Upvotes: 1
Views: 105
Reputation: 473873
The elements containing the js-directory-link
class have also other classes, example:
<a href="/daylerees/colour-schemes/tree/master/jetbrains/contrast" class="js-directory-link js-navigation-open" id="c8fd07f040a8f2dc85f5b2d3804ea3db-6b332f6820ec47d7ade641dbf72108b025b10440" title="contrast">contrast</a>
You need to use a partial class attribute match via contains()
:
//a[contains(@class, "js-directory-link")]
Or, you may use CSS selectors:
for sel in response.css('a.js-directory-link'):
Though I would really think about using github API instead.
Upvotes: 1