Reputation: 483

Downloading a Large Folder From Google-Drive Using Python

I'm trying to download a large folder with 50000 images from my GDrive into a local server using Python. The following code receives a limitation error. Any alternative solutions?

import gdown
url = 'https://drive.google.com/drive/folders/135hTTURfjn43fo4f?usp=sharing'  # I'm showing a fake token
gdown.download_folder(url)

Failed to retrieve folder contents:

The gdrive folder with url: https://drive.google.com/drive/folders/135hTTURfjn43fo4f?usp=sharing has at least 50 files, gdrive can't download more than this limit, if you are ok with this, please run again with --remaining-ok flag.

Upvotes: 8

Answers (5)

Olin

Reputation: 41

The download limit is set in ../gdown/download_folder.py

If you installed gdown in a virtual environment, simply edit the download_folder.py file located in .venv/lib/python3.*/site-packages/gdown/. Edit the line MAX_NUMBER_FILES = 50 and set the value to your new maximum.

Upvotes: 4

Oleg Petrov

Reputation: 45

I was trying to download CORDv0 from google drive via CLI, and there is no other good way for one line downloading. The best way is to save the folder as zip archive to your disk and then download as the unified file.

In some cases, the idea of changing download limit can help. In colab, I used:

!pip uninstall gdown --yes
!cd .. && git clone https://github.com/wkentaro/gdown

with open('../gdown/gdown/download_folder.py', 'r') as f:
    code = f.read().replace('MAX_NUMBER_FILES = 50', 'MAX_NUMBER_FILES = 10000')

with open('../gdown/gdown/download_folder.py', 'w') as f:
    f.write(code)

!cd ../gdown && pip install -e . --no-cache-dir
!pip show gdown

But please remember about gdown errors. As it was aforementioned, gdown lib is not the best choice.

Upvotes: 1

noobforever

Reputation: 29

This is a workaround that I used to download urls using gdown

Go to the drive directory from which you need to download the files
select all the files using ctrl/cmd A. click on share + and copy all the links
Now use the following python script to do your job

import re
import os
urls = <copied_urls>
url_list = urls.split(', ')
pat = re.compile('https://drive.google.com/file/d/(.*)/view\?usp=sharing')
for url in url_list:
    g = re.match(pat,url)
    id = g.group(1)
    down_url = f'https://drive.google.com/uc?id={id}'
    os.system(f'gdown {down_url}')

Note: This solution isn't ideal for 50000 images as the copied urls string will be too huge. If your string is huge, copy it in a file and process it instead of using a variable. In my case I had to copy 75 large files

Upvotes: 2

Mohamed Salah

Reputation: 93

!pip uninstall --yes gdown # After running this line, restart Colab runtime.
!pip install gdown -U --no-cache-dir
import gdown

url = r'https://drive.google.com/drive/folders/1sWD6urkwyZo8ZyZBJoJw40eKK0jDNEni'
gdown.download_folder(url)

Upvotes: -3

NightEye

Reputation: 11194

As what kite has mentioned in the comments, use it with the remaining_ok flag.

gdown.download_folder(url, remaining_ok=True)

This wasn't mentioned in https://pypi.org/project/gdown/ so there might be any confusion.

Any references on remaining_ok isn't available aside from the warning and this github code.

EDIT:

Seems like gdown is strictly limited to 50 files and haven't found a way of circumventing it.

If other than gdown is an option, then see code below.

Script:

import io
import os
import os.path
from googleapiclient.discovery import build
from googleapiclient.http import MediaIoBaseDownload
from google.oauth2 import service_account

credential_json = {
    ### Create a service account and use its the json content here ###
    ### https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account
    ### credentials.json looks like this:
    "type": "service_account",
    "project_id": "*********",
    "private_key_id": "*********",
    "private_key": "-----BEGIN PRIVATE KEY-----\n*********\n-----END PRIVATE KEY-----\n",
    "client_email": "service-account@*********.iam.gserviceaccount.com",
    "client_id": "*********",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account%40*********.iam.gserviceaccount.com"
}

credentials = service_account.Credentials.from_service_account_info(credential_json)
drive_service = build('drive', 'v3', credentials=credentials)

folderId = '### Google Drive Folder ID ###'
outputFolder = 'output'

# Create folder if not existing
if not os.path.isdir(outputFolder):
    os.mkdir(outputFolder)

items = []
pageToken = ""
while pageToken is not None:
    response = drive_service.files().list(q="'" + folderId + "' in parents", pageSize=1000, pageToken=pageToken,
                                          fields="nextPageToken, files(id, name)").execute()
    items.extend(response.get('files', []))
    pageToken = response.get('nextPageToken')

for file in items:
    file_id = file['id']
    file_name = file['name']
    request = drive_service.files().get_media(fileId=file_id)
    ### Saves all files under outputFolder
    fh = io.FileIO(outputFolder + '/' + file_name, 'wb')
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print(f'{file_name} downloaded completely.')

References:

https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account