ARASH
ARASH

Reputation: 428

How to scrape infinite scroll page of Kaggle dataset in Python?

I want to extract the list of all dataset available in Kaggle, see URL: kaggle.com/datasets

However, since the page is infinite scroll based, I cannot use conventional scrapping methods in which the whole page is being loaded at once. Any suggestion is very appreciated.

Upvotes: 1

Views: 1429

Answers (3)

Rachael Tatman
Rachael Tatman

Reputation: 889

You don't actually need to scrape, you can get a list of all the available datasets from the API. Once you have it installed and configured, you can get a list of datasets like so:

kaggle datasets list

You can also use the API to download datasets. For example, this will download the CITES Wildlife Trade Database. (If you're just interested in a specific dataset, you can get the code to download it at the bottom of the data listing for that dataset.)

kaggle datasets download -d cites/cites-wildlife-trade-database

Hope that helps! :)

Upvotes: 1

Granitosaurus
Granitosaurus

Reputation: 21446

If you inspect the browser, you can see that everytime you scroll down an AJAX request is being made in the networks tab.

The request is being made to:

https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=2

Which returns the results in json format. You can continue incirmenting page untill you reach max results. The json file has key u'totalDatasetListItems': 770 and returns 20 results per search, so you can use that info to develop a loop.

This example is for python3 and shows how to get concurrent requests running with this sort of pagination ssytem.

import scrapy
import json
from w3lib.url import add_or_replace_parameter
class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=1']

    def parse(self, response):
        data = json.loads(response.body) 
        total_results = data['totalDatasetListItems']
        page = 1
        # figure out how many pages are there and loop through them.
        for i in range(20, total_results, 20):  # step 20 since we have 20 results per page
            url = add_or_replace_parameter(response.url, 'page', page)
            yield scrapy.Request(url, self.parse_page)

        # don't forget to parse first page as well!
        yield from self.parse_page(self, response)

    def parse_page(self, response):
        data = json.loads(response.body) 
        # parse page data here
        for item in data['datasetListItems']:
            item = dict()
            yield item

Upvotes: 4

bbanzzakji
bbanzzakji

Reputation: 92

The website fetches the list data via the GET request. Send GET request to a url :

https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=1

In the response function you should parse json data like this:

bundle_of_data = json.loads(response.body)
for entry in prod_data['bundle_of_data ']
    title = entry['title]
    forum_url = entry['forumUrl']
    ...

In this way you can get 2nd page's data via increasing number of "page" parameter in the url.

https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=2
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=3
https://www.kaggle.com/datasets.json?sortBy=hottest&group=all&page=4
...

until no data retrieves.

Upvotes: 0

Related Questions