Reputation: 1000
I have two google spreadsheets:
QC- many columns, I want to check if a value from column 4 appears in the second spreadsheet lastEdited_PEID; if it does, it would put 'Bingo!' in column 14 of the same row where the value was found
lastEdited- one column, long spreadsheets of values
I achieve that with the following code:
#acces the documents on Drive
QC = gc.open_by_key("FIRST KEY").sheet1
lastEdited = gc.open_by_key("SECOND KEY").sheet1
#get values from columns and convert to lists
QC_PEID = QC.col_values(4)
lastEdited_PEID = lastEdited.col_values(1)
#iterate by rows and check if value from each row appears in the second document
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!')
So it does the job but does it very slowly (about 5 minutes). I am concerned about the speed because I have to perform the operation for about 50 spreadsheets (avg. 6000 rows each).
I tried to remove the element from the second list when found (it can only appear once) with the following code in the loop:
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!')
**lastEdited_PEID.remove('value')**
I thought it would make it faster as the reference list would be shorter but surprisingly it takes even more.
What could I do to make the process quicker?
Upvotes: 2
Views: 2687
Reputation: 3321
Since gspread is a wrapper around the Google Sheet's REST API each operation you perform on a spreadsheet renders to an HTTP request to the API. Most of the time this is the slowest part of the code. If you want to improve performance you need to figure out how to reduce the number of interactions with the API.
In your code sample each col_values()
call makes a single HTTP request. This is good. But then, when you iterating over cells values, there's an update_cell()
in a loop:
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!') # it makes 2 HTTP requests each time
update_cell
makes two HTTP requests to the API (one to retrieve information needed to update the cell and another to actually send the update to the API.) You need to avoid this method call in your loop.
A better idea is to collect all updates and send them in a batch. This is what update_cells()
method is for.
update_cells()
needs a list of Cell
objects to do the batch update. You can get those by calling Worksheet.range()
.
This is what comes in into my mind:
# A utility method
def col_cells(worksheet, col):
"""Returns a range of cells in a `worksheet`'s column `col`."""
start_cell = self.get_addr_int(1, col)
end_cell = self.get_addr_int(worksheet.row_count, col)
return worksheet.range('%s:%s' % (start_cell, end_cell))
QC_PEID = QC.col_values(4)
lastEdited_PEID = set(lastEdited.col_cells(1)) # make the 'in' lookup a bit faster
column_14_cells = col_cells(QC, 14)
has_updates = False
# iterate by rows and check if value from each row appears in the second document
for i, value in enumerate(QC_PEID):
if value in lastEdited_PEID:
has_updates = True
column_14_cells[i].value = 'Bingo!'
if has_updates:
QC.update_cells(column_14_cells)
I didn't run the code. Beware of typos.
Upvotes: 5