abarbosa
abarbosa

Reputation: 41

Scrapy xpath aria-select=false

I am trying to get the transcription info from some Khan Academy videos using scrapy. For example: https://www.khanacademy.org/math/algebra-basics/basic-alg-foundations/alg-basics-negative-numbers/v/opposite-of-a-number

When I Tried to select the Transcript button through xpath response.xpath('//div[contains(@role, "tablist")]/a').extract() I only got the information about the tab has the aria-selected="true" which is the About section. I would need to use scrapy to change the aria-selected from false to true in the Transcript button and then retrieve the necessary information.

Could anyone please clarify how I would be able to accomplish this?

Much appreciated !

Upvotes: 1

Views: 189

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

If you take a look at your network inspect you can see that an AJAX request is being made to retrieve the transcript once the page loads:

enter image description here

In this case it's https://www.khanacademy.org/api/internal/videos/2Zk6u7Uk5ow/transcript?casing=camel&locale=en&lang=en It seems to use youtube video url id to create this api url. So you can recreate it really easily:

import json
import scrapy
class MySpider(scrapy.Spider):
    #...
    transcript_url_template = 'https://www.khanacademy.org/api/internal/videos/{}/transcript?locale=en&lang=en'

    def parse(self, response):
        # find youtube id
        youtube_id = response.xpath("//meta[@property='og:video']/@content").re_first('v/(.+)')
        # create transcript API url using the youtube id
        url = self.transcript_url_template.format(youtube_id)
        # download the data and parse it
        yield Request(url, self.parse_transript)

    def parse_transcript(self, response):
        # convert json data to python dictionary
        data = json.loads(response.body)
        # parse your data!

Upvotes: 1

Related Questions