Reputation: 41
I am trying to get the transcription info from some Khan Academy videos using scrapy. For example: https://www.khanacademy.org/math/algebra-basics/basic-alg-foundations/alg-basics-negative-numbers/v/opposite-of-a-number
When I Tried to select the Transcript button through xpath response.xpath('//div[contains(@role, "tablist")]/a').extract()
I only got the information about the tab has the aria-selected="true"
which is the About section. I would need to use scrapy to change the aria-selected
from false to true in the Transcript button and then retrieve the necessary information.
Could anyone please clarify how I would be able to accomplish this?
Much appreciated !
Upvotes: 1
Views: 189
Reputation: 21436
If you take a look at your network inspect you can see that an AJAX request is being made to retrieve the transcript once the page loads:
In this case it's https://www.khanacademy.org/api/internal/videos/2Zk6u7Uk5ow/transcript?casing=camel&locale=en&lang=en It seems to use youtube video url id to create this api url. So you can recreate it really easily:
import json
import scrapy
class MySpider(scrapy.Spider):
#...
transcript_url_template = 'https://www.khanacademy.org/api/internal/videos/{}/transcript?locale=en&lang=en'
def parse(self, response):
# find youtube id
youtube_id = response.xpath("//meta[@property='og:video']/@content").re_first('v/(.+)')
# create transcript API url using the youtube id
url = self.transcript_url_template.format(youtube_id)
# download the data and parse it
yield Request(url, self.parse_transript)
def parse_transcript(self, response):
# convert json data to python dictionary
data = json.loads(response.body)
# parse your data!
Upvotes: 1