Reputation: 26333
I want to download a dataset from Kaggle in python then work on it. When I click the download button say on this page https://www.kaggle.com/quora/question-pairs-dataset
my browser is loading a zip file. However if I do in python
import requests, zipfile, io
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(os.getcwd())
I get the error
Traceback (most recent call last):
File "C:\Users\u\Documents\Maths\Code\Variational Auto Encoders\VAE_text_generation.py", line 33, in <module>
z = zipfile.ZipFile(io.BytesIO(r.content))
File "C:\Users\u\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
self._RealGetContents()
File "C:\Users\u\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Any solution for getting the file from the url - unzip if needed - and get the name of the downloaded file so as to be able to work on it ?
EDIT Also tested the following which does not produce an error but files are not the .csv and .txt I expect, probably because they have not been unzipped or could not be uploaded correctly for rights were missing.
from urllib.request import urlretrieve
def report(num, size, total):
print(num*size, '/', total)
# get kaggle data
urlretrieve("https://www.kaggle.com/quora/question-pairs-dataset/download","train.csv",reporthook=report)
TRAIN_DATA_FILE = os.getcwd() + 'train.csv'
## get glove
urlretrieve("https://www.kaggle.com/watts2/glove6b50dtxt/download",
"glove.txt",reporthook=report)
GLOVE_EMBEDDING = os.getcwd() + 'glove.txt'
Upvotes: 1
Views: 4109
Reputation: 36735
zipfile.BadZipFile: File is not a zip file
Clearly what you got is not ZIP file, Content-Type
response header is useful for determining what you got, I did
import requests
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
print(r.headers['Content-Type'])
output
text/html; charset=utf-8
So this is HTML page, as my browser is loading a zip file I suspect that access to this resource required being logged in otherwise you are redirect to page allowing logging in. To make requests
-based downloading work you would need to find how checking is done by Kaggle and conform to it.
Upvotes: 2
Reputation: 304
First try to use response.raw.decode_content = True
Like so:
import requests, zipfile, io
response = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
response.raw.decode_content = True
z = zipfile.ZipFile(io.BytesIO(response.content))
z.extractall(os.getcwd())
Upvotes: 0