Reputation: 7179

Turn function into a generator

With the following method I'm able to list all files from my Google Drive Account:

def listAllFiles(self):
    result = [];
    page_token = None;

    while True:
        try:
            param = {"q" : "trashed=false", "orderBy": "createdTime"};
            if page_token: param['pageToken'] = page_token;
            files = self.service.files().list(**param).execute();

            result.extend(files["files"]);
            page_token = files.get('nextPageToken');
            if not page_token: break;

        except errors.HttpError as error:
            print('An error occurred:', error);
            break; # Exit with empty list

    return result;

For a better runtime I would like to return a generator from this method. I'm pretty new to Python so I don't know how to do this.

The execute method from the files services always returns 100 items and if it returns a page_token too there are more items to fetch. It would be great if I could iterate over the generator to get the already fetched items and in the mean time the next items are fetched from the service. I hope you understand what I mean...

Is this possible? How do I have to rewrite this method to get the described functionality?

Upvotes: 2

Answers (3)

Michael Hoff

Reputation: 6358

You can rewrite your function to act as a generator by simply yielding single file paths.

Untested:

def listAllFiles(self):
    result = []
    page_token = None

    while True:
        try:
            param = {"q" : "trashed=false", "orderBy": "createdTime"}
            if page_token:
                param['pageToken'] = page_token
            files = self.service.files().list(**param).execute()

            # call future to load the next bunch of files here!
            for f in files["files"]:
                yield f
            page_token = files.get('nextPageToken')
            if not page_token: break

        except errors.HttpError as error:
            print('An error occurred:', error)
            break

If you do not further parallelize use chapelo's answer instead. Yielding the list of all available files will allow the coroutine to continue and thus, begin to fetch the next list of files concurrently.

Preloading the next bunch with futures

Now, you are still not loading the next bunch of files concurrently. For this, as mentioned in the code above, you could execute a future to already gather the next list of files concurrently. When your yielded item is consumed (and your function continues to execute) you look into your future to see whether the result is already there. If not, you have to wait (as before) until the result arrives.

As I don't have your code available I can not say whether this code works (or is even syntactically correct), but you can use it as a starting point:

import concurrent.futures

def load_next_page(self, page_token=None):
    param = {"q" : "trashed=false", "orderBy": "createdTime"}
    if page_token:
        param['pageToken'] = page_token

    result = None
    try:
        files = self.service.files().list(**param).execute()
        result = (files.get('nextPageToken'), files["files"])
    except errors.HttpError as error:
        print('An error occurred:', error)
    return result

def listAllFiles(self):
    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:

        future = executor.submit(self.load_next_page, 60) 

        while future:
            try:
                result = future.result()
                future = None
                if not result:
                    break
                (next_page_token, files) = result            
            except Exception as error:
                print('An error occured:', error)
                break
            if next_page_token:
                future = executor.submit(self.load_next_page, next_page_token, 60) 
            # yield from files
            for f in files:
                yield f

Producer/Consumer parallelization with Queues

Another option, as also mentioned in the comments, is to use a Queue. You can modify your function to return a queue which is filled by a thread spawned by your function. This should faster than only preloading the next list, but also yields a higher implementation overhead.

I, personally, would recommend to go with the future path -- if the performance is adequate.

Upvotes: 4

chapelo

Reputation: 2562

If you yield each file at a time, you are blocking the generator. But if you yield the whole list that the generator has prepared, while you process the list of files, the generator will have another list ready for you:

Instead of Michael's suggestion

for f in files["files"]:
    yield f

Try to yield the whole list at once, and process the whole list of files when you receive it:

yield files["files"]

Consider this simple example:

from string import ascii_uppercase as letters, digits
lst_of_lsts = [[l+d for d in digits] for l in letters]

def get_a_list(list_of_lists):
    for lst in list_of_lists:
        yield lst  # the whole list, not each element at a time

gen = get_a_list(lst_of_lsts)

print(gen.__next__()) # ['A0', 'A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9']

print(gen.__next__()) # ['B0', 'B1', 'B2', 'B3', 'B4', 'B5', 'B6', 'B7', 'B8', 'B9']

print(gen.__next__()) # ['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9']

# And so on...

Upvotes: 1

Kyle

Reputation: 4455

You're going to have to change the flow of your script. Instead of returning all the files at once, you're going to need to yield individual files. ~~This will allow you to handle the fetching of results in the background as well.~~

Edit: The fetching of subsequent results would be transparent to the calling function, it would simply appear to take a bit longer. Essentially, once the current list of files have all been yielded to the calling function, you would get the next list, and start yielding from that list, repeat until there are no more files to list from Google Drive.

I highly suggest reading What does the "yield" keyword do in Python? to understand the concept behind generators & the yield statement.

Upvotes: 0

Turn function into a generator

Answers (3)

Preloading the next bunch with futures

Producer/Consumer parallelization with Queues

Related Questions