Reputation: 428
I want to extract the list of all dataset available in Kaggle, see URL: kaggle.com/datasets
However, since the page is infinite scroll based, I cannot use conventional scrapping methods in which the whole page is being loaded at once. Any suggestion is very appreciated.
Upvotes: 1
Views: 1429
Reputation: 889
You don't actually need to scrape, you can get a list of all the available datasets from the API. Once you have it installed and configured, you can get a list of datasets like so:
kaggle datasets list
You can also use the API to download datasets. For example, this will download the CITES Wildlife Trade Database. (If you're just interested in a specific dataset, you can get the code to download it at the bottom of the data listing for that dataset.)
kaggle datasets download -d cites/cites-wildlife-trade-database
Hope that helps! :)
Upvotes: 1
Reputation: 21446
If you inspect the browser, you can see that everytime you scroll down an AJAX request is being made in the networks tab.
The request is being made to:
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=2
Which returns the results in json format. You can continue incirmenting page
untill you reach max results. The json file has key u'totalDatasetListItems': 770
and returns 20 results per search, so you can use that info to develop a loop.
This example is for python3 and shows how to get concurrent requests running with this sort of pagination ssytem.
import scrapy
import json
from w3lib.url import add_or_replace_parameter
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=1']
def parse(self, response):
data = json.loads(response.body)
total_results = data['totalDatasetListItems']
page = 1
# figure out how many pages are there and loop through them.
for i in range(20, total_results, 20): # step 20 since we have 20 results per page
url = add_or_replace_parameter(response.url, 'page', page)
yield scrapy.Request(url, self.parse_page)
# don't forget to parse first page as well!
yield from self.parse_page(self, response)
def parse_page(self, response):
data = json.loads(response.body)
# parse page data here
for item in data['datasetListItems']:
item = dict()
yield item
Upvotes: 4
Reputation: 92
The website fetches the list data via the GET request. Send GET request to a url :
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=1
In the response function you should parse json data like this:
bundle_of_data = json.loads(response.body)
for entry in prod_data['bundle_of_data ']
title = entry['title]
forum_url = entry['forumUrl']
...
In this way you can get 2nd page's data via increasing number of "page" parameter in the url.
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=2
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=3
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=4
...
until no data retrieves.
Upvotes: 0