user14073111
user14073111

Reputation: 621

List more than 10000 files in google drive with python

I have a google drive folder that contains more than 10000 subfolders. Im trying to list these sub folders using this code:

import pickle
import os.path
import io
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from numpy import cumproduct
import pandas as pd
import gdown
from pyasn1.type.constraint import ContainedSubtypeConstraint
import requests
from googleapiclient.http import MediaIoBaseDownload
import httplib2

SCOPES = ['https://www.googleapis.com/auth/drive']

creds = None
if os.path.exists('token.pickle'):
    with open('token.pickle', 'rb') as token:
        creds = pickle.load(token)
if not creds or not creds.valid:
    if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
    else:
        flow = InstalledAppFlow.from_client_secrets_file(
                'test.json', SCOPES)
        creds = flow.run_local_server(port=0)
    with open('token.pickle', 'wb') as token:
        pickle.dump(creds, token)

service = build('drive', 'v3', credentials=creds)

folder_id='valid folder id'
query=f"parents = '{folder_id}'"

response=service.files().list(q=query).execute()
files=response.get('files')
nextPageToken=response.get('nextPageToken')

while nextPageToken:
    response=service.files().list(q=query).execute()
    files.extend(response.get('files'))
    nextPageToken=response.get('nextPageToken')

df = pd.DataFrame(files)
print(df)

And while debugging I saw that it got response for only 100 sub folders. How can I modify this script to list all 10000+ subfolders?

Upvotes: 0

Views: 404

Answers (1)

Kristkun
Kristkun

Reputation: 5953

It seems you forgot to set pageToken parameter using your nextPageToken value in your files.list() request within your while-loop.

It should be like this:

while nextPageToken:
    response=service.files().list(pageToken=nextPageToken, q=query).execute()
    files.extend(response.get('files'))
    nextPageToken=response.get('nextPageToken')

You might also want to consider increasing your pageSize parameter.. pageSize is the maximum number of files to return per page. Acceptable values are 1 to 1000, inclusive. (Default: 100). See File.List() parameters

Your code (with pageSize):

service = build('drive', 'v3', credentials=creds)

folder_id='valid folder id'
query=f"parents = '{folder_id}'"

response=service.files().list(pageSize=1000, q=query).execute()
files=response.get('files')
nextPageToken=response.get('nextPageToken')

while nextPageToken:
    response=service.files().list(pageSize=1000, pageToken=nextPageToken, q=query).execute()
    files.extend(response.get('files'))
    nextPageToken=response.get('nextPageToken')

Another Sample Implementation:

service = build('drive', 'v3', credentials=creds)
    
folder_id='valid folder id'
query=f"parents = '{folder_id}'"
page_token = None
my_files = list()
while True:
    results = service.files().list(pageSize=1000, pageToken=page_token, q=query).execute()
    files = results.get('files', [])
    my_files.extend(files)
    page_token = results.get('nextPageToken', None)
    if page_token is None:
        break

Upvotes: 1

Related Questions