Reputation: 933
I am trying to scrape lynda.com courses and storing their info in a csv file. This is my code
# -*- coding: utf-8 -*-
import scrapy
import itertools
class LyndadevSpider(scrapy.Spider):
name = 'lyndadev'
allowed_domains = ['lynda.com']
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
#print(response.url)
titles = response.xpath('//li[@role="presentation"]//h3/text()').extract()
descs = response.xpath('//li[@role="presentation"]//div[@class="meta-description hidden-xs dot-ellipsis dot-resize-update"]/text()').extract()
links = response.xpath('//li[@role="presentation"]/div/div/div[@class="col-xs-8 col-sm-9 card-meta-data"]/a/@href').extract()
for title, desc, link in itertools.izip(titles, descs, links):
#print link
categ = scrapy.Request(link, callback=self.parse2)
yield {'desc': link, 'category': categ}
def parse2(self, response):
#getting categories by storing the navigation info
item = response.xpath('//ol[@role="navigation"]').extract()
return item
What I am trying to do here is that I am grabbing the titles, description of the list of tutorials and then navigating to the url and grabbing the categories in parse2.
However, I get results like this:
category,desc
<GET https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html>,https://www.lynda.com/SVN-Subversion-tutorials/SVN-Java-Developers/552873-2.html
<GET https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html>,https://www.lynda.com/Java-tutorials/WebSocket-Programming-Java-EE/574694-2.html
<GET https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html>,https://www.lynda.com/GameMaker-tutorials/Building-Physics-Based-Platformer-GameMaker-Studio-Using-GML/598780-2.html
How do I access the information that I want?
Upvotes: 2
Views: 374
Reputation: 10210
You need to yield
a scrapy.Request
in the parse
method that parses the responses of start_urls
(instead of yielding a dict). Also, I would rather loop over course items and extract the information for each course item separately.
I'm not sure what you mean exactly by categories. I suppose those are the tags you can see on the course details page at the bottom under Skills covered in this course. But I might be wrong.
Try this code:
# -*- coding: utf-8 -*-
import scrapy
class LyndaSpider(scrapy.Spider):
name = "lynda"
allowed_domains = ["lynda.com"]
start_urls = ['https://www.lynda.com/Developer-training-tutorials']
def parse(self, response):
courses = response.css('ul#category-courses div.card-meta-data')
for course in courses:
item = {
'title': course.css('h3::text').extract_first(),
'desc': course.css('div.meta-description::text').extract_first(),
'link': course.css('a::attr(href)').extract_first(),
}
request = scrapy.Request(item['link'], callback=self.parse_course)
request.meta['item'] = item
yield request
def parse_course(self, response):
item = response.meta['item']
#item['categories'] = response.css('div.tags a em::text').extract()
item['category'] = response.css('ol.breadcrumb li:last-child a span::text').extract_first()
return item
Upvotes: 2