Reputation: 10551

Scrapy spider outputs empy csv file

This is my first question here and I'm learning how to code by myself so please bear with me.

I'm working on a final CS50 project which I'm trying to built a website that aggregates online Spanish course from edx.org and other open online couses websites maybe. I'm using scrapy framework to scrap the filter results of Spanish courses on edx.org... Here is my first scrapy spider which I'm trying to get in each courses link to then get it's name (after I get the code right, also get the description, course url and more stuff).

from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader

class Course_item(Item):
    name = Field()
    #description = Field()
    #img_url = Field()


class Course_spider(CrawlSpider):
    name = 'CourseSpider'
    allowed_domains = ['https://www.edx.org/']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

    rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)

    def parse_item(self, response):
        item = ItemLoader(Course_item, response)
        item.add_xpath('name', '//*[@id="course-intro-heading"]/text()')

        yield item.load_item()

When I run the spider with "scrapy runspider edxSpider.py -o edx.csv -t csv" I get an empty csv file and I also think is not getting into the right spanish courses results.

Basically I want to get in each courses of this link edx Spanish courses and get the name, description, provider, page url and img url.

Any ideas for why might be the problem?

Upvotes: 3

Answers (3)

Jey Miranda

Reputation: 10551

from scrapy.http import Request
from scrapy import Spider
import json


class edx_scraper(Spider):

name = "edxScraper"
start_urls = [
    'https://www.edx.org/api/v1/catalog/search?selected_facets[]=content_type_exact%3Acourserun&selected_facets[]=language_exact%3ASpanish&page=1&page_size=9&partner=edx&hidden=0&content_type[]=courserun&content_type[]=program&featured_course_ids=course-v1%3AHarvardX+CS50B+Business%2Ccourse-v1%3AMicrosoft+DAT206x+1T2018%2Ccourse-v1%3ALinuxFoundationX+LFS171x+3T2017%2Ccourse-v1%3AHarvardX+HDS2825x+1T2018%2Ccourse-v1%3AMITx+6.00.1x+2T2017_2%2Ccourse-v1%3AWageningenX+NUTR101x+1T2018&featured_programs_uuids=452d5bbb-00a4-4cc9-99d7-d7dd43c2bece%2Cbef7201a-6f97-40ad-ad17-d5ea8be1eec8%2C9b729425-b524-4344-baaa-107abdee62c6%2Cfb8c5b14-f8d2-4ae1-a3ec-c7d4d6363e26%2Ca9cbdeb6-5fc0-44ef-97f7-9ed605a149db%2Cf977e7e8-6376-400f-aec6-84dcdb7e9c73'
]

def parse(self, response):
    data = json.loads(response.text)
    for course in data['objects']['results']:
        url = 'https://www.edx.org/api/catalog/v2/courses/' + course['key']
        yield response.follow(url, self.course_parse)

    if 'next' in data['objects'] is not None:
        yield response.follow(data['objects']['next'], self.parse)

def course_parse(self, response):
    course = json.loads(response.text)
    yield{
        'name': course['title'],
        'effort': course['effort'],
    }

Upvotes: 0

eLRuLL

Reputation: 18799

You can't get edx content with a simple request, it uses javascript rendering for getting the course element dynamically, so CrawlSpider won't work on this case, because you need to find specific elements inside the response body to generate a new Request that will get what you need.

The real request (to get the urls of the courses) is this one, but you need to generate it from the previous response body (although you could just visit it an also get the correct data).

So, to generate the real request, you need data that is inside a script tag:

from scrapy import Spider
import re
import json

class Course_spider(Spider):
    name = 'CourseSpider'
    allowed_domains = ['edx.org']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

    def parse(self, response):
        script_text = response.xpath('//script[contains(text(), "Drupal.settings")]').extract_first()
        parseable_json_data = re.search(r'Drupal.settings, ({.+})', script_text).group(1)
        json_data = json.loads(parseable_json_data)
        ...

Now you have what you need on json_data and only need to create the string URL.

Upvotes: 2

furas

Reputation: 142631

This page use JavaScript to get data from server and add to page.

It uses urls like

https://www.edx.org/api/catalog/v2/courses/course-v1:IDBx+IDB33x+3T2017

Last part is course's number which you can find in HTML

<main id="course-info-page" data-course-id="course-v1:IDBx+IDB33x+3T2017">

Code

from scrapy.http import Request
from scrapy.item import Field, Item
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractor import LinkExtractor
from scrapy.loader import ItemLoader
import json

class Course_spider(CrawlSpider):

    name = 'CourseSpider'
    allowed_domains = ['www.edx.org']
    start_urls = ['https://www.edx.org/course/?language=Spanish']

    rules = (Rule(LinkExtractor(allow=r'/course'), callback='parse_item', follow='True'),)

    def parse_item(self, response):
        print('parse_item url:', response.url)

        course_id = response.xpath('//*[@id="course-info-page"]/@data-course-id').extract_first()

        if course_id:
            url = 'https://www.edx.org/api/catalog/v2/courses/' + course_id
            yield Request(url, callback=self.parse_json)

    def parse_json(self, response):
        print('parse_json url:', response.url)

        item = json.loads(response.body)

        return item

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',     # csv, json, xml
    'FEED_URI': 'output.csv', #     
})
c.crawl(Course_spider)
c.start()

Upvotes: 1

Scrapy spider outputs empy csv file

Answers (3)

Related Questions