Reputation: 181
I have trying to download the kaggle dataset
by using python. However i was facing issues by using the request
method and the downloaded output .csv files is a corrupted html files.
import requests
# The direct link to the Kaggle data set
data_url = 'https://www.kaggle.com/crawford/gene-expression/downloads/actual.csv'
# The local path where the data set is saved.
local_filename = "actsual.csv"
# Kaggle Username and Password
kaggle_info = {'UserName': "myUsername", 'Password': "myPassword"}
# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)
# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data = kaggle_info)
# Writes the data to a local file one chunk at a time.
f = open(local_filename, 'wb')
for chunk in r.iter_content(chunk_size = 512 * 1024): # Reads 512KB at a time into memory
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.close()
Output file
<!DOCTYPE html>
<html>
<head>
<title>Gene expression dataset (Golub et al.) | Kaggle</title>
<meta charset="utf-8" />
<meta name="robots" content="index, follow"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0"> <meta name="theme-color" content="#008ABC" />
<link rel="dns-prefetch" href="https://www.google-analytics.com" /><link rel="dns-prefetch" href="https://stats.g.doubleclick.net" /><link rel="dns-prefetch" href="https://js.intercomcdn.com" /><link rel="preload" href="https://az416426.vo.msecnd.net/scripts/a/ai.0.js" as=script /><link rel="dns-prefetch" href="https://kaggle2.blob.core.windows.net" />
<link href="/content/v/d420a040e581/kaggle/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<link rel="manifest" href="/static/json/manifest.json">
<link href="//fonts.googleapis.com/css?family=Open+Sans:400,300,300italic,400italic,600,600italic,700,700italic" rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" href="/static/assets/vendor.css?v=72f4ef2ebe4f"/>
<link rel="stylesheet" type="text/css" href="/static/assets/app.css?v=d997fa977b65"/>
<script>
(function () {
var originalError = window.onerror;
window.onerror = function (message, url, lineNumber, columnNumber, error) {
var handled = originalError && originalError(message, url, lineNumber, columnNumber, error);
var blockedByCors = message && message.toLowerCase().indexOf("script error") >= 0;
return handled || blockedByCors;
};
})();
</script>
<script>
var appInsights=window.appInsights||function(config){
function i(config){t[config]=function(){var i=arguments;t.queue.push(function(){t[config].apply(t,i)})}}var t={config:config},u=document,e=window,o="script",s="AuthenticatedUserContext",h="start",c="stop",l="Track",a=l+"Event",v=l+"Page",y=u.createElement(o),r,f;y.src=config.url||"https://az416426.vo.msecnd.net/scripts/a/ai.0.js";u.getElementsByTagName(o)[0].parentNode.appendChild(y);try{t.cookie=u.cookie}catch(p){}for(t.queue=[],t.version="1.0",r=["Event","Exception","Metric","PageView","Trace","Dependency"];r.length;)i("track"+r.pop());return i("set"+s),i("clear"+s),i(h+a),i(c+a),i(h+v),i(c+v),i("flush"),config.disableExceptionTracking||(r="onerror",i("_"+r),f=e[r],e[r]=function(config,i,u,e,o){var s=f&&f(config,i,u,e,o);return s!==!0&&t["_"+r](config,i,u,e,o),s}),t
}({
instrumentationKey:"5b3d6014-f021-4304-8366-3cf961d5b90f",
disableAjaxTracking: true
});
window.appInsights=appInsights;
appInsights.trackPageView();
</script>
Upvotes: 18
Views: 55315
Reputation: 1
I developed a package to search and download Kaggle datasets from a Jupyter Notebook: https://pypi.org/project/kaggle-downloader-package/#description.
I leave you the repo as well in case you want to collaborate on it: https://github.com/Mgobeaalcoba/kaggle_downloader_package
Upvotes: 0
Reputation: 6749
I tested some of the solutions provided here, but some are outdated as of today: 12/22/2023. Here is what I implemented and tested for pyton 3.12.1 under jupyter lab:
import os
""" Downloads competition files from Kaggle assuming you previously downloaded the
kaggle.json file and put in the location indicated here:
https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials.
It unzips the files and place them in the dataDir location. If the folder contains
information it deletes it before. Once the information is unziped,
it removes the zip file"""
def downloadInputData(competitionName, dataDir='input'):
import importlib.util
if importlib.util.find_spec('kaggle') is None:
! pip install kaggle --quiet
import kaggle
kaggle.api.authenticate() # raise an error if the kaggle.json is not in the expected location
# download and unzip competition data
! rm -rf {dataDir} # removing data files if they exist
! kaggle competitions download -q {competitionName} # -q for quite download
! mkdir -p {dataDir}
zipFile = competitionName + '.zip'
if not os.path.exists(zipFile):
print(f"Error: , {zipFile}, not found.")
else:
# -q silent option (no output), concatenate rm to remove the zip file
! unzip -q {zipFile} -d {dataDir} && rm {zipFile}
"""Get kaggle.json file from Colab and puts in the expected location as it is specified by
https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials
It assumes the kaggle.json file is in Google Drive
at the Colab Notebooks location. This function can be invoked only from Colab,
because it is where the pacakge google.colab exists"""
def getKeyFileFromColab():
from google.colab import drive # declaring it here to avoid ModuleNotFoundError in Kaggle
# We need to escape the space ('\\ ')
gdrive_kaggleCreds_file = '/content/drive/My\\ Drive/Colab\\ Notebooks/kaggle.json'
kaggleDir = '~/.kaggle' # Destination folder
kaggle_file = kaggleDir + '/' + 'kaggle.json' # Destination file
drive.mount("/content/drive", force_remount=False)
! mkdir -p {kaggleDir} # -p option doesn't raise an error if the folder exists
! cp {gdrive_kaggleCreds_file} {kaggleDir}
! chmod 600 {kaggle_file} # user read/write
drive.flush_and_unmount()
# Testing
isLocal = True # Using a local notebook or Kaggle
isColab = True # Control if the local environment is Colab
loadKeyFile = False # Control to download the Kaggle key file
competitionName = 'titanic'
dataDir = 'input/' if isLocal==True else '/kaggle/input/' + competitionName + '/'
workDir = 'working/' if isLocal==True else '/kaggle/working/'
if isLocal: # Creating the working folder when working locally
! mkdir -p {workDir}
# Getting kaggle.json file from Colab and putting it in the correct location
if isColab and loadKeyFile: getKeyFileFromColab()
# downloading competition files from Kaggle
downloadInputData(competitionName=competitionName, dataDir=dataDir)
The function getKeyFileFromColab
just downloads the kaggle.json
file that is stored in Colab and puts it in the expected location. If you are not using Colab, then you cannot invoke this function to download the kaggle.json
, you need instead to do it manually and puts the file in the expected folder location: ~/.kaggle
, i.e. $HOME/.kaggle
. We need to download this file once within the same active session, that is why we have a separate function for it. You can control the process via loadKeyFile
control variable.
Once we have the correct setup, then we can download the competition files via downloadInputData
function.
Upvotes: 0
Reputation: 6714
I have really struggled with the Kaggle API so I use opendatasets
. It is important to have your kaggle.json
in the same folder as your notebook.
pip install opendatasets
import opendatasets as od
od.download("https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction/data","/mypath/goes/here")
Upvotes: 0
Reputation: 11
Full version of example Download_Kaggle_Dataset_To_Colab with explanation under Windows that start work for me
#Step1
#Input:
from google.colab import files
files.upload() #this will prompt you to upload the kaggle.json. Download from Kaggle>Kaggle API-file.json. Save to PC to PC folder and choose it here
#Output Sample:
#kaggle.json
#kaggle.json(application/json) - 69 bytes, last modified: 29.06.2021 - 100% done
#Saving kaggle.json to kaggle.json
#{'kaggle.json':
#b'{"username":"sergeysukhov7","key":"23d4d4abdf3bee8ba88e653cec******"}'}
#Step2
#Input:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json # set permission
#Output:
#kaggle.json
#Step3
#Input:
#Set the enviroment variables
import os
os.environ['KAGGLE_USERNAME'] = "sergeysukhov7" #manually input My_Kaggle User_Name
os.environ['KAGGLE_KEY'] = "23d4d4abdf3bee8ba88e653cec5*****" #manually input My_Kaggle Key
#Step4
#!kaggle datasets download -d zillow/zecon #download dataset to default folder content/zecon.zip if I want
#find kaggle dataset link (for example) https://www.kaggle.com/willkoehrsen/home-credit-default-risk-feature-tools and choose part_of_the_link - willkoehrsen/home-credit-default-risk-feature-tools
#set link_from Kaggle willkoehrsen/home-credit-default-risk-feature-tools
#set Colab folder download_to /content/gdrive/My Drive/kaggle/credit/home-credit-default-risk-feature-tools.zip
!kaggle datasets download -d willkoehrsen/home-credit-default-risk-feature-tools -p /content/gdrive/My\ Drive/kaggle/credit
#Output
#Downloading home-credit-default-risk-feature-tools.zip to /content/gdrive/My Drive/kaggle/credit
#100% 3.63G/3.63G [01:31<00:00, 27.6MB/s]
#100% 3.63G/3.63G [01:31<00:00, 42.7MB/s]
Upvotes: 1
Reputation: 46291
Before anything:
pip install kaggle
For the dataset:
import os
os.environ['KAGGLE_USERNAME'] = "uname" # username from the json file
os.environ['KAGGLE_KEY'] = "kaggle_key" # key from the json file
!kaggle datasets download -d zynicide/wine-reviews
For the competitions:
import os
os.environ['KAGGLE_USERNAME'] = "uname" # username from the json file
os.environ['KAGGLE_KEY'] = "kaggle_key" # key from the json file
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition
Some time ago I provided another similar answer.
Upvotes: 2
Reputation: 1721
Ref https://github.com/Kaggle/kaggle-api
Step _1, Try Insatling Kaggle
pip install kaggle # Windows
pip install --user kaggle # **Mac/Linux**.
Step 2,
Update your Credentials, so that kaggle can authenticate on .kaggle/kaggel_json
based on your token generated from Kaggle.
ref: https://medium.com/@ankushchoubey/how-to-download-dataset-from-kaggle-7f700d7f9198
Step 3
Now Instaed ofkaggle competitions download ..
run ~/.local/bin/kaggle competitions download ..
to avoid Command Kaggle Not Found
Upvotes: -3
Reputation: 101
kaggle api key and usersame is available on kaggle profile page and dataset download link is available on dataset details page on kaggle
#Set the enviroment variables
import os
os.environ['KAGGLE_USERNAME'] = "xxxx"
os.environ['KAGGLE_KEY'] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition
Upvotes: 10
Reputation: 711
Basically, if you want to use the Kaggle python API (the solution provided by @minh-triet is for the command line not for python) you have to do the following:
import kaggle
kaggle.api.authenticate()
kaggle.api.dataset_download_files('The_name_of_the_dataset', path='the_path_you_want_to_download_the_files_to', unzip=True)
I hope this helps.
Upvotes: 28
Reputation: 433
Just to make things easy for the next person, I combined the fantastic answer from CaitLAN Jenner with a little bit of code that takes the raw csv
info and puts it into a Pandas DataFrame
, assuming that row 0
has the column names. I used it to download the Pima Diabetes dataset from Kaggle, and it worked swimmingly.
I'm sure there are more elegant ways to do this, but it worked well enough for a class I was teaching, is easily interpretable, and lets you get to analysis with minimal fuss.
import pandas as pd
import requests
import csv
payload = {
'__RequestVerificationToken': '',
'username': 'username',
'password': 'password',
'rememberme': 'false'
}
loginURL = 'https://www.kaggle.com/account/login'
dataURL = "https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/diabetes.csv"
with requests.Session() as c:
response = c.get(loginURL).text
AFToken = response[response.index('antiForgeryToken')+19:response.index('isAnonymous: ')-12]
#print("AntiForgeryToken={}".format(AFToken))
payload['__RequestVerificationToken']=AFToken
c.post(loginURL + "?isModal=true&returnUrl=/", data=payload)
download = c.get(dataURL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
#for row in my_list:
# print(row)
df = pd.DataFrame(my_list)
header = df.iloc[0]
df = df[1:]
diab = df.set_axis(header, axis='columns', inplace=False)
# to make sure it worked, uncomment this next line:
# diab
`
Upvotes: -3
Reputation: 1250
I would recommend checking out Kaggle API instead of using your own code. As per latest version, an example command to download dataset is
kaggle datasets download -d zillow/zecon
Upvotes: 6