user3541631
user3541631

Reputation: 4008

Increase the speed of an excel file operations (using openpyxl): check value and delete rows operations if condition

I have a medium size excel file, with about 25000 rows.

In the excel file I check if a specific column value is in a list, and if is in the list I delete the row.

I'm using openpyxl.

The code:

   count = 1
    while count <= ws.max_row:
        if ws.cell(row=count, column=2).value in remove_list:
            ws.delete_rows(count, 1)
        else:
            count += 1
    wb.save(src)

The code works, but is very slow(take hours) to finish.

I know that is a read-only and write-only modes, but in my case, I use both, first checking and second deleting.

Upvotes: 1

Views: 2268

Answers (4)

Master Kraken
Master Kraken

Reputation: 1

Adding on to ketdaddy's response. I tested it and noticed that when you use this sequence in a for loop as suggested, you need to update the row number in every loop to account for the deleted rows.

For example, when you get to the second step in the loop, the start row is not the original start row, it's the original start row minus the rows which were previously deleted.

This code will update ketdaddy's sequence to generate a sequence which takes this into account.

original sequence = get_sequences(deleterows)
updated_sequence=[]
cumdelete = 0
for start, delete in original sequence:
    new_start = start-cumdelete
    cumdelete = cumdelete + delete
    updated_sequence.append([new_start, delete])

updated_sequence

Upvotes: 0

GCru
GCru

Reputation: 516

When a number of rows have to be deleted from a sheet, I create a list of these row numbers, e.g. remove_list and then I rewrite the sheet to a temporary sheet, excluding these rows. I delete the original sheet and rename the temporary sheet to the original sheet. See my function for doing this below:

def delete_excel_rows_with_openpyxl(workbook, sheet, remove_list): 
    """ Delete rows with row numbers in remove_list from sheet contained in workbook """ 

    temp_sheet = workbook.create_sheet('TempSheet')

    destination_row_counter = 1
    for source_row_counter, source_row in enumerate(sheet.iter_rows(min_row=1, max_row=sheet.max_row)):

        try:
            i = remove_list.index(source_row_counter+1) # enumerate counts from 0 and sheet from 1
            # do not copy row
            del remove_list[i]
        except ValueError:
            # copy row
            column_count = 1
            for cell in source_row:
                temp_sheet.cell(row=destination_row_counter, column=column_count).value = cell.value
                column_count = column_count + 1

            destination_row_counter = destination_row_counter + 1

    sheet_title = sheet.title
    workbook.remove_sheet(sheet)
    temp_sheet.title = sheet_title

    return workbook, temp_sheet   

Upvotes: 0

ketdaddy
ketdaddy

Reputation: 56

I see you are using a list of rows which you need to delete. Instead, you can create "sequences" of rows to delete, thus changing a delete list like [2,3,4,5,6,7,8,45,46,47,48] to one like [[2, 7],[45, 4]]

i.e. Delete 7 rows starting at row 2, then delete 4 rows starting at row 45

Deleting in bulk is faster than 1 by 1. I deleted 6k rows in around 10 seconds

The following code will convert a list to a list of lists/sequences:

def get_sequences(list_of_ints):
    sequence_count = 1
    sequences = []
    for row in list_of_ints:
        next_item = None
        if list_of_ints.index(row) < (len(list_of_ints) - 1):
            next_item = list_of_ints[list_of_ints.index(row) + 1]

        if (row + 1) == next_item:
            sequence_count += 1
        else:
            first_in_sequence = list_of_ints[list_of_ints.index(row) - sequence_count + 1]
            sequences.append([first_in_sequence, sequence_count])
            sequence_count = 1

    return sequences

Then run another loop to delete

    for sequence in sequences:
        sheet.delete_rows(sequence[0], sequence[1])

Upvotes: 4

Pynchia
Pynchia

Reputation: 11596

Personally, I would do two things:

first transform the list into a set so the lookup of the item takes less time

remove_set = set(remove_list)
...
if ws.cell(row=count, column=2).value in remove_set:

then I would avoid removing the rows in place, as it takes a lot of time to reorganise the data structures representing the sheet.

I would create a new blank worksheet and add to it only the rows which must be kept.

Then save the new worksheet, overwriting the original if you wish.

If it still takes too long, consider using a CSV format so you can treat the input data as text and output it the same way, re-importing the data later from the spreadsheet program (e.g. Ms-Excel)

Have a look at the official docs and at this tutorial to find out how to use the CSV library

Further note: as spotted by @Charlie Clark, the calculation of

ws.max_row

may take some time as well and there is no need to repeat it.

To do that, the easiest solution is to work backwards from the last row down to the first, so that the deleted rows do not affect the position of the ones before them.

Upvotes: 1

Related Questions