Reputation: 10551
This is my first question here and I'm learning how to code by myself so please bear with me.
I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff).
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
class Course_item(Item):
name = Field()
#description = Field()
#img_url = Field()
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['https://www.edx.org/']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
item = ItemLoader(Course_item, response)
item.add_xpath('name', '//*[@id="course-intro-heading"]/text()')
yield item.load_item()
When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results.
Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url.
Any ideas for why might be the problem?
Upvotes: 3
Views: 246
Reputation: 10551
from scrapy.http import Request
from scrapy import Spider
import json
class edx_scraper(Spider):
name = "edxScraper"
start_urls = [
'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]
def parse(self, response):
data = json.loads(response.text)
for course in data['objects']['results']:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
yield response.follow(url, self.course_parse)
if 'next' in data['objects'] is not None:
yield response.follow(data['objects']['next'], self.parse)
def course_parse(self, response):
course = json.loads(response.text)
yield{
'name': course['title'],
'effort': course['effort'],
}
Upvotes: 0
Reputation: 18799
You can't get edx
content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider
won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need.
The real request (to get the urls of the courses) is this one, but you need to generate it from the previous response body (although you could just visit it an also get the correct data).
So, to generate the real request, you need data that is inside a script
tag:
from scrapy import Spider
import re
import json
class Course_spider(Spider):
name = 'CourseSpider'
allowed_domains = ['edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
def parse(self, response):
script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
json_data = json.loads(parseable_json_data)
...
Now you have what you need on json_data
and only need to create the string URL.
Upvotes: 2
Reputation: 142631
This page use JavaScript to get data from server and add to page.
It uses urls like
https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017
Last part is course's number which you can find in HTML
<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">
Code
from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json
class Course_spider(CrawlSpider):
name = 'CourseSpider'
allowed_domains = ['www.edx.org']
start_urls = ['https://www.edx.org/course/?language=Spanish']
rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)
def parse_item(self, response):
print('parse_item url:', response.url)
course_id = response.xpath('//*[@id="course-info-page"]/@data-course-id').extract_first()
if course_id:
url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
yield Request(url, callback=self.parse_json)
def parse_json(self, response):
print('parse_json url:', response.url)
item = json.loads(response.body)
return item
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv', # csv, json, xml
'FEED_URI': 'output.csv', #
})
c.crawl(Course_spider)
c.start()
Upvotes: 1