How to download a file from Kaggle and work on it in python

Question

I want to download a dataset from Kaggle in python then work on it. When I click the download button say on this page https://www.kaggle.com/quora/question-pairs-dataset

my browser is loading a zip file. However if I do in python

import requests, zipfile, io
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(os.getcwd())

I get the error

Traceback (most recent call last):
  File "C:\Users\u\Documents\Maths\Code\Variational Auto Encoders\VAE_text_generation.py", line 33, in 
    z = zipfile.ZipFile(io.BytesIO(r.content))
  File "C:\Users\u\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\u\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Any solution for getting the file from the url - unzip if needed - and get the name of the downloaded file so as to be able to work on it ?

EDIT Also tested the following which does not produce an error but files are not the .csv and .txt I expect, probably because they have not been unzipped or could not be uploaded correctly for rights were missing.

from urllib.request import urlretrieve

def report(num, size, total):
    print(num*size, '/', total)

# get kaggle data 
urlretrieve("https://www.kaggle.com/quora/question-pairs-dataset/download","train.csv",reporthook=report)
TRAIN_DATA_FILE = os.getcwd() + 'train.csv' 

## get glove 

urlretrieve("https://www.kaggle.com/watts2/glove6b50dtxt/download",
    "glove.txt",reporthook=report)
GLOVE_EMBEDDING = os.getcwd() + 'glove.txt'

Daweo · Accepted Answer

zipfile.BadZipFile: File is not a zip file

Clearly what you got is not ZIP file, Content-Type response header is useful for determining what you got, I did

import requests
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
print(r.headers['Content-Type'])

output

text/html; charset=utf-8

So this is HTML page, as my browser is loading a zip file I suspect that access to this resource required being logged in otherwise you are redirect to page allowing logging in. To make requests-based downloading work you would need to find how checking is done by Kaggle and conform to it.

How to download a file from Kaggle and work on it in python

Answers (2)

Related Questions