Reputation: 49
I'm trying to scrape the results from here using scrapy. The problem is that not all of the classes appear on the page until the 'load more results' tab is clicked.
The problem can be seen here:
My code looks like this:
class ClassCentralSpider(CrawlSpider):
name = "class_central"
allowed_domains = ["www.class-central.com"]
start_urls = (
'https://www.class-central.com/courses/recentlyAdded',
)
rules = (
Rule(
LinkExtractor(
# allow=("index\d00\.html",),
restrict_xpaths=('//div[@id="show-more-courses"]',)
),
callback='parse',
follow=True
),
)
def parse(self, response):
x = response.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print item['name']
pass
Upvotes: 0
Views: 2264
Reputation: 21436
The second page for this website seems to be generated via AJAX call. If you look into network tab of any browser inspection tool, you'll see something like:
In this case it seems to be retrieving a json file from https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134
Now it seems that url parameter _=1469471093134
does nothing so you can just trim it away to: https://www.class-central.com/maestro/courses/recentlyAdded?page=2
The return json contains html code for the next page:
# so you just need to load it up with
data = json.loads(response.body)
# and convert it to scrapy selector -
sel = Selector(text=data['table'])
To replicate this in your code try something like:
from w3lib.url import add_or_replace_parameter
def parse(self, response):
# check if response is json, if so convert to selector
if response.meta.get('is_json',False):
# convert the json to scrapy.Selector here for parsing
sel = Selector(text=json.loads(response.body)['table'])
else:
sel = Selector(response)
# parse page here for items
x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
item = ClasscentralItem()
for y in x:
item['name'] = y
print(item['name'])
# do next page
next_page_el = respones.xpath("//div[@id='show-more-courses']")
if next_page_el: # there is next page
next_page = response.meta.get('page',1) + 1
# make next page url
url = add_or_replace_parameter(url, 'page', next_page)
yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)
Upvotes: 1