Johnson
Johnson

Reputation: 181

Download Kaggle Dataset by using Python

I have trying to download the kaggle dataset by using python. However i was facing issues by using the request method and the downloaded output .csv files is a corrupted html files.

import requests

# The direct link to the Kaggle data set
data_url = 'https://www.kaggle.com/crawford/gene-expression/downloads/actual.csv'

# The local path where the data set is saved.
local_filename = "actsual.csv"

# Kaggle Username and Password
kaggle_info = {'UserName': "myUsername", 'Password': "myPassword"}

# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)

# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data = kaggle_info)

# Writes the data to a local file one chunk at a time.
f = open(local_filename, 'wb')
for chunk in r.iter_content(chunk_size = 512 * 1024): # Reads 512KB at a time into memory

    if chunk: # filter out keep-alive new chunks
        f.write(chunk)
f.close()

Output file

<!DOCTYPE html>
<html>
<head>
    <title>Gene expression dataset (Golub et al.) | Kaggle</title>
    <meta charset="utf-8" />
    <meta name="robots" content="index, follow"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">    <meta name="theme-color" content="#008ABC" />
    <link rel="dns-prefetch" href="https://www.google-analytics.com" /><link rel="dns-prefetch" href="https://stats.g.doubleclick.net" /><link rel="dns-prefetch" href="https://js.intercomcdn.com" /><link rel="preload" href="https://az416426.vo.msecnd.net/scripts/a/ai.0.js" as=script /><link rel="dns-prefetch" href="https://kaggle2.blob.core.windows.net" />
    <link href="/content/v/d420a040e581/kaggle/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <link rel="manifest" href="/static/json/manifest.json">
    <link href="//fonts.googleapis.com/css?family=Open+Sans:400,300,300italic,400italic,600,600italic,700,700italic" rel='stylesheet' type='text/css'>
                    <link rel="stylesheet" type="text/css" href="/static/assets/vendor.css?v=72f4ef2ebe4f"/>
        <link rel="stylesheet" type="text/css" href="/static/assets/app.css?v=d997fa977b65"/>
        <script>

            (function () {
                var originalError = window.onerror;

                window.onerror = function (message, url, lineNumber, columnNumber, error) {
                    var handled = originalError && originalError(message, url, lineNumber, columnNumber, error);
                    var blockedByCors = message && message.toLowerCase().indexOf("script error") >= 0;
                    return handled || blockedByCors;
                };
            })();
        </script>
    <script>
        var appInsights=window.appInsights||function(config){
        function i(config){t[config]=function(){var i=arguments;t.queue.push(function(){t[config].apply(t,i)})}}var t={config:config},u=document,e=window,o="script",s="AuthenticatedUserContext",h="start",c="stop",l="Track",a=l+"Event",v=l+"Page",y=u.createElement(o),r,f;y.src=config.url||"https://az416426.vo.msecnd.net/scripts/a/ai.0.js";u.getElementsByTagName(o)[0].parentNode.appendChild(y);try{t.cookie=u.cookie}catch(p){}for(t.queue=[],t.version="1.0",r=["Event","Exception","Metric","PageView","Trace","Dependency"];r.length;)i("track"+r.pop());return i("set"+s),i("clear"+s),i(h+a),i(c+a),i(h+v),i(c+v),i("flush"),config.disableExceptionTracking||(r="onerror",i("_"+r),f=e[r],e[r]=function(config,i,u,e,o){var s=f&&f(config,i,u,e,o);return s!==!0&&t["_"+r](config,i,u,e,o),s}),t
        }({
            instrumentationKey:"5b3d6014-f021-4304-8366-3cf961d5b90f",
            disableAjaxTracking: true
        });
        window.appInsights=appInsights;
        appInsights.trackPageView();
    </script>

Upvotes: 18

Views: 55315

Answers (10)

I developed a package to search and download Kaggle datasets from a Jupyter Notebook: https://pypi.org/project/kaggle-downloader-package/#description.

I leave you the repo as well in case you want to collaborate on it: https://github.com/Mgobeaalcoba/kaggle_downloader_package

Upvotes: 0

David Leal
David Leal

Reputation: 6749

I tested some of the solutions provided here, but some are outdated as of today: 12/22/2023. Here is what I implemented and tested for pyton 3.12.1 under jupyter lab:

import os

""" Downloads competition files from Kaggle assuming you previously downloaded the 
kaggle.json file and put in the location indicated here: 
https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials. 
It unzips the files and place them in the dataDir location. If the folder contains 
information it deletes it before. Once the information is unziped, 
it removes the zip file"""
def downloadInputData(competitionName, dataDir='input'):
  import importlib.util
  if importlib.util.find_spec('kaggle') is None:
    ! pip install kaggle --quiet
  import kaggle
  kaggle.api.authenticate() # raise an error if the kaggle.json is not in the expected location

  # download and unzip competition data
  ! rm -rf {dataDir}  # removing data files if they exist
  ! kaggle competitions download -q {competitionName} # -q for quite download
  ! mkdir -p {dataDir}
  zipFile = competitionName + '.zip'
  if not os.path.exists(zipFile):
    print(f"Error: , {zipFile}, not found.")
  else:
    # -q silent option (no output), concatenate rm to remove the zip file
    ! unzip -q {zipFile} -d {dataDir} && rm {zipFile}

"""Get kaggle.json file from Colab and puts in the expected location as it is specified by
https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials
It assumes the kaggle.json file is in Google Drive
at the Colab Notebooks location. This function can be invoked only from Colab,
because it is where the pacakge google.colab exists"""
def getKeyFileFromColab():
  from google.colab import drive # declaring it here to avoid ModuleNotFoundError in Kaggle
  # We need to escape the space ('\\ ')
  gdrive_kaggleCreds_file = '/content/drive/My\\ Drive/Colab\\ Notebooks/kaggle.json'
  kaggleDir = '~/.kaggle' # Destination folder
  kaggle_file = kaggleDir + '/' + 'kaggle.json' # Destination file
  drive.mount("/content/drive", force_remount=False)
  ! mkdir -p {kaggleDir} # -p option doesn't raise an error if the folder exists
  ! cp {gdrive_kaggleCreds_file} {kaggleDir}
  ! chmod 600 {kaggle_file} # user read/write
  drive.flush_and_unmount()

# Testing
isLocal = True # Using a local notebook or Kaggle
isColab = True # Control if the local environment is Colab
loadKeyFile = False # Control to download the Kaggle key file
competitionName = 'titanic'
dataDir = 'input/' if isLocal==True else '/kaggle/input/' + competitionName + '/'
workDir = 'working/' if isLocal==True else '/kaggle/working/'

if isLocal: # Creating the working folder when working locally
  ! mkdir -p {workDir}
  # Getting kaggle.json file from Colab and putting it in the correct location
  if isColab and loadKeyFile: getKeyFileFromColab()
  # downloading competition files from Kaggle
  downloadInputData(competitionName=competitionName, dataDir=dataDir)

The function getKeyFileFromColab just downloads the kaggle.json file that is stored in Colab and puts it in the expected location. If you are not using Colab, then you cannot invoke this function to download the kaggle.json, you need instead to do it manually and puts the file in the expected folder location: ~/.kaggle, i.e. $HOME/.kaggle. We need to download this file once within the same active session, that is why we have a separate function for it. You can control the process via loadKeyFile control variable.

Once we have the correct setup, then we can download the competition files via downloadInputData function.

here is the output: output

Upvotes: 0

Climbs_lika_Spyder
Climbs_lika_Spyder

Reputation: 6714

I have really struggled with the Kaggle API so I use opendatasets. It is important to have your kaggle.json in the same folder as your notebook.

pip install opendatasets

import opendatasets as od

od.download("https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction/data","/mypath/goes/here")

Documentation

Upvotes: 0

Sergey Sukhov
Sergey Sukhov

Reputation: 11

Full version of example Download_Kaggle_Dataset_To_Colab with explanation under Windows that start work for me

#Step1
#Input:
from google.colab import files
files.upload()  #this will prompt you to upload the kaggle.json. Download from Kaggle>Kaggle API-file.json. Save to PC to PC folder and choose it here

#Output Sample:
#kaggle.json
#kaggle.json(application/json) - 69 bytes, last modified: 29.06.2021 - 100% done
#Saving kaggle.json to kaggle.json
#{'kaggle.json': 
#b'{"username":"sergeysukhov7","key":"23d4d4abdf3bee8ba88e653cec******"}'}

#Step2
#Input:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json  # set permission

#Output:
#kaggle.json

#Step3
#Input:
#Set the enviroment variables
import os
os.environ['KAGGLE_USERNAME'] = "sergeysukhov7"  #manually input My_Kaggle User_Name 
os.environ['KAGGLE_KEY'] = "23d4d4abdf3bee8ba88e653cec5*****"  #manually input My_Kaggle Key 

#Step4
#!kaggle datasets download -d zillow/zecon #download dataset to default folder content/zecon.zip if I want 

#find kaggle dataset link (for example) https://www.kaggle.com/willkoehrsen/home-credit-default-risk-feature-tools and choose part_of_the_link - willkoehrsen/home-credit-default-risk-feature-tools
#set link_from Kaggle willkoehrsen/home-credit-default-risk-feature-tools
#set Colab folder download_to  /content/gdrive/My Drive/kaggle/credit/home-credit-default-risk-feature-tools.zip
!kaggle datasets download -d willkoehrsen/home-credit-default-risk-feature-tools -p /content/gdrive/My\ Drive/kaggle/credit 

#Output
#Downloading home-credit-default-risk-feature-tools.zip to /content/gdrive/My Drive/kaggle/credit
#100% 3.63G/3.63G [01:31<00:00, 27.6MB/s]
#100% 3.63G/3.63G [01:31<00:00, 42.7MB/s]

Upvotes: 1

prosti
prosti

Reputation: 46291

Before anything:

pip install kaggle

For the dataset:

import os
os.environ['KAGGLE_USERNAME'] = "uname" # username from the json file
os.environ['KAGGLE_KEY'] = "kaggle_key" # key from the json file
!kaggle datasets download -d zynicide/wine-reviews

For the competitions:

import os
os.environ['KAGGLE_USERNAME'] = "uname" # username from the json file
os.environ['KAGGLE_KEY'] = "kaggle_key" # key from the json file
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition

Some time ago I provided another similar answer.

Upvotes: 2

user2458922
user2458922

Reputation: 1721

Ref https://github.com/Kaggle/kaggle-api

Step _1, Try Insatling Kaggle

pip install kaggle # Windows
pip install --user kaggle # **Mac/Linux**.

Step 2,

Update your Credentials, so that kaggle can authenticate on .kaggle/kaggel_json based on your token generated from Kaggle. ref: https://medium.com/@ankushchoubey/how-to-download-dataset-from-kaggle-7f700d7f9198

Step 3 Now Instaed ofkaggle competitions download ..

run ~/.local/bin/kaggle competitions download .. to avoid Command Kaggle Not Found

Upvotes: -3

Murat Uslu
Murat Uslu

Reputation: 101

kaggle api key and usersame is available on kaggle profile page and dataset download link is available on dataset details page on kaggle

#Set the enviroment variables
import os
os.environ['KAGGLE_USERNAME'] = "xxxx"
os.environ['KAGGLE_KEY'] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition

Upvotes: 10

Yannis
Yannis

Reputation: 711

Basically, if you want to use the Kaggle python API (the solution provided by @minh-triet is for the command line not for python) you have to do the following:

import kaggle

kaggle.api.authenticate()

kaggle.api.dataset_download_files('The_name_of_the_dataset', path='the_path_you_want_to_download_the_files_to', unzip=True)

I hope this helps.

Upvotes: 28

Beau Hilton
Beau Hilton

Reputation: 433

Just to make things easy for the next person, I combined the fantastic answer from CaitLAN Jenner with a little bit of code that takes the raw csv info and puts it into a Pandas DataFrame, assuming that row 0 has the column names. I used it to download the Pima Diabetes dataset from Kaggle, and it worked swimmingly.

I'm sure there are more elegant ways to do this, but it worked well enough for a class I was teaching, is easily interpretable, and lets you get to analysis with minimal fuss.

import pandas as pd
import requests
import csv

payload = {
    '__RequestVerificationToken': '',
    'username': 'username',
    'password': 'password',
    'rememberme': 'false'
}

loginURL = 'https://www.kaggle.com/account/login'
dataURL = "https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/diabetes.csv"

with requests.Session() as c:
    response = c.get(loginURL).text
    AFToken = response[response.index('antiForgeryToken')+19:response.index('isAnonymous: ')-12]
    #print("AntiForgeryToken={}".format(AFToken))
    payload['__RequestVerificationToken']=AFToken
    c.post(loginURL + "?isModal=true&returnUrl=/", data=payload)
    download = c.get(dataURL)
    decoded_content = download.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    #for row in my_list:
    #    print(row)


df = pd.DataFrame(my_list)
header = df.iloc[0]
df = df[1:]
diab = df.set_axis(header, axis='columns', inplace=False)

# to make sure it worked, uncomment this next line:
# diab

`

Upvotes: -3

Minh Triet
Minh Triet

Reputation: 1250

I would recommend checking out Kaggle API instead of using your own code. As per latest version, an example command to download dataset is kaggle datasets download -d zillow/zecon

Upvotes: 6

Related Questions