kiriloff
kiriloff

Reputation: 26333

How to download a file from Kaggle and work on it in python

I want to download a dataset from Kaggle in python then work on it. When I click the download button say on this page https://www.kaggle.com/quora/question-pairs-dataset

enter image description here

my browser is loading a zip file. However if I do in python

import requests, zipfile, io
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(os.getcwd())

I get the error

Traceback (most recent call last):
  File "C:\Users\u\Documents\Maths\Code\Variational Auto Encoders\VAE_text_generation.py", line 33, in <module>
    z = zipfile.ZipFile(io.BytesIO(r.content))
  File "C:\Users\u\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
    self._RealGetContents()
  File "C:\Users\u\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Any solution for getting the file from the url - unzip if needed - and get the name of the downloaded file so as to be able to work on it ?

EDIT Also tested the following which does not produce an error but files are not the .csv and .txt I expect, probably because they have not been unzipped or could not be uploaded correctly for rights were missing.

from urllib.request import urlretrieve

def report(num, size, total):
    print(num*size, '/', total)

# get kaggle data 
urlretrieve("https://www.kaggle.com/quora/question-pairs-dataset/download","train.csv",reporthook=report)
TRAIN_DATA_FILE = os.getcwd() + 'train.csv' 

## get glove 

urlretrieve("https://www.kaggle.com/watts2/glove6b50dtxt/download",
    "glove.txt",reporthook=report)
GLOVE_EMBEDDING = os.getcwd() + 'glove.txt'

Upvotes: 1

Views: 4109

Answers (2)

Daweo
Daweo

Reputation: 36735

zipfile.BadZipFile: File is not a zip file

Clearly what you got is not ZIP file, Content-Type response header is useful for determining what you got, I did

import requests
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
print(r.headers['Content-Type'])

output

text/html; charset=utf-8

So this is HTML page, as my browser is loading a zip file I suspect that access to this resource required being logged in otherwise you are redirect to page allowing logging in. To make requests-based downloading work you would need to find how checking is done by Kaggle and conform to it.

Upvotes: 2

Tikhon Petrishchev
Tikhon Petrishchev

Reputation: 304

First try to use response.raw.decode_content = True

Like so:

import requests, zipfile, io

response = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
response.raw.decode_content = True
z = zipfile.ZipFile(io.BytesIO(response.content))
z.extractall(os.getcwd())

Upvotes: 0

Related Questions