Reputation: 7117
With the following method I'm able to list all files from my Google Drive Account:
def listAllFiles(self):
result = [];
page_token = None;
while True:
try:
param = {"q" : "trashed=false", "orderBy": "createdTime"};
if page_token: param['pageToken'] = page_token;
files = self.service.files().list(**param).execute();
result.extend(files["files"]);
page_token = files.get('nextPageToken');
if not page_token: break;
except errors.HttpError as error:
print('An error occurred:', error);
break; # Exit with empty list
return result;
For a better runtime I would like to return a generator from this method. I'm pretty new to Python so I don't know how to do this.
The execute method from the files services always returns 100 items and if it returns a page_token
too there are more items to fetch. It would be great if I could iterate over the generator to get the already fetched items and in the mean time the next items are fetched from the service. I hope you understand what I mean...
Is this possible? How do I have to rewrite this method to get the described functionality?
Upvotes: 2
Views: 2465
Reputation: 6318
You can rewrite your function to act as a generator by simply yielding single file paths.
Untested:
def listAllFiles(self):
result = []
page_token = None
while True:
try:
param = {"q" : "trashed=false", "orderBy": "createdTime"}
if page_token:
param['pageToken'] = page_token
files = self.service.files().list(**param).execute()
# call future to load the next bunch of files here!
for f in files["files"]:
yield f
page_token = files.get('nextPageToken')
if not page_token: break
except errors.HttpError as error:
print('An error occurred:', error)
break
If you do not further parallelize use chapelo's answer instead. Yielding the list of all available files will allow the coroutine to continue and thus, begin to fetch the next list of files concurrently.
Now, you are still not loading the next bunch of files concurrently. For this, as mentioned in the code above, you could execute a future to already gather the next list of files concurrently. When your yielded item is consumed (and your function continues to execute) you look into your future to see whether the result is already there. If not, you have to wait (as before) until the result arrives.
As I don't have your code available I can not say whether this code works (or is even syntactically correct), but you can use it as a starting point:
import concurrent.futures
def load_next_page(self, page_token=None):
param = {"q" : "trashed=false", "orderBy": "createdTime"}
if page_token:
param['pageToken'] = page_token
result = None
try:
files = self.service.files().list(**param).execute()
result = (files.get('nextPageToken'), files["files"])
except errors.HttpError as error:
print('An error occurred:', error)
return result
def listAllFiles(self):
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(self.load_next_page, 60)
while future:
try:
result = future.result()
future = None
if not result:
break
(next_page_token, files) = result
except Exception as error:
print('An error occured:', error)
break
if next_page_token:
future = executor.submit(self.load_next_page, next_page_token, 60)
# yield from files
for f in files:
yield f
Another option, as also mentioned in the comments, is to use a Queue. You can modify your function to return a queue which is filled by a thread spawned by your function. This should faster than only preloading the next list, but also yields a higher implementation overhead.
I, personally, would recommend to go with the future path -- if the performance is adequate.
Upvotes: 4
Reputation: 2562
If you yield each file at a time, you are blocking the generator. But if you yield the whole list that the generator has prepared, while you process the list of files, the generator will have another list ready for you:
Instead of Michael's suggestion
for f in files["files"]:
yield f
Try to yield the whole list at once, and process the whole list of files when you receive it:
yield files["files"]
Consider this simple example:
from string import ascii_uppercase as letters, digits
lst_of_lsts = [[l+d for d in digits] for l in letters]
def get_a_list(list_of_lists):
for lst in list_of_lists:
yield lst # the whole list, not each element at a time
gen = get_a_list(lst_of_lsts)
print(gen.__next__()) # ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9']
print(gen.__next__()) # ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9']
print(gen.__next__()) # ['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']
# And so on...
Upvotes: 1
Reputation: 4445
You're going to have to change the flow of your script. Instead of returning all the files at once, you're going to need to yield
individual files. This will allow you to handle the fetching of results in the background as well.
Edit: The fetching of subsequent results would be transparent to the calling function, it would simply appear to take a bit longer. Essentially, once the current list of files have all been yielded to the calling function, you would get the next list, and start yielding from that list, repeat until there are no more files to list from Google Drive.
I highly suggest reading What does the "yield" keyword do in Python? to understand the concept behind generators & the yield
statement.
Upvotes: 0