DavidRguez
DavidRguez

Reputation: 1110

Memory error using openpyxl and large data excels

I have written a script which has to read lot of excel files from a folder (around 10,000). This script loads the excel file (some of them has more than 2,000 rows) and reads one column to count the number of rows (checking stuff). If the number of rows is not equal to a given number, it writes the warning in a log.

The problem comes when the script reads more than 1,000 excel files. It's then when it throws memory error, and I don't know where could be the problem. Previously, the script read two csv file with 14,000 rows and stores it in a list. These lists contain an identificator for the excel file and its respective number of rows. If this number of rows is not equal to the number of rows of the excel file, it writes the warning. Could be the problem the reading of these lists?

I'm using openpyxl to load the workbooks, do I need to close them before open the next?

This is my code:

# -*- coding: utf-8 -*-

import os
from openpyxl import Workbook
import glob
import time
import csv
from time import gmtime,strftime
from openpyxl import load_workbook

folder = ''
conditions = 0
a = 0
flight_error = 0
condition_error = 0
typical_flight_error = 0
SP_error = 0


cond_numbers = []
with open('Conditions.csv','rb') as csv_name:           # Abre el fichero csv donde estarán las equivalencias   
    csv_read = csv.reader(csv_name,delimiter='\t')

    for reads in csv_read:
        cond_numbers.append(reads)

flight_TF = []
with open('vuelo-TF.csv','rb') as vuelo_TF:
    csv_read = csv.reader(vuelo_TF,delimiter=';')

    for reads in csv_read:
        flight_TF.append(reads)


excel_files = glob.glob('*.xlsx')

for excel in excel_files:
    print "Leyendo excel: "+excel

    wb = load_workbook(excel)
    ws = wb.get_sheet_by_name('Control System')
    flight = ws.cell('A7').value
    typical_flight = ws.cell('B7').value
    a = 0

    for row in range(6,ws.get_highest_row()):
        conditions = conditions + 1


        value_flight = int(ws.cell(row=row,column=0).value)
        value_TF = ws.cell(row=row,column=1).value
        value_SP = int(ws.cell(row=row,column=4).value)

        if value_flight == '':
            break

        if value_flight != flight:
            flight_error = 1                # Si no todos los flight numbers dentro del vuelo son iguales

        if value_TF != typical_flight:
            typical_flight_error = 2            # Si no todos los typical flight dentro del vuelo son iguales

        if value_SP != 100:
            SP_error = 1



    for cond in cond_numbers:
        if int(flight) == int(cond[0]):
            conds = int(cond[1])
            if conds != int(conditions):
                condition_error = 1         # Si el número de condiciones no se corresponde con el esperado

    for vuelo_TF in flight_TF:
        if int(vuelo_TF[0]) == int(flight):
            TF = vuelo_TF[1]
            if typical_flight != TF:
                typical_flight_error = 1        # Si el vuelo no coincide con el respectivo typical flight

    if flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+':  Los flight numbers del vuelo '+str(flight)+' no coinciden.\n'
        log.write(message)
        log.close()
        flight_error = 0

    if condition_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El número de condiciones del vuelo '+str(flight)+' no coincide. Condiciones esperadas: '+str(int(conds))+'. Condiciones obtenidas: '+str(int(conditions))+'.\n'
        log.write(message)
        log.close()
        condition_error = 0

    if typical_flight_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': El vuelo '+str(flight)+' no coincide con el typical flight. Typical flight respectivo: '+TF+'. Typical flight obtenido: '+typical_flight+'.\n'
        log.write(message)
        log.close() 
        typical_flight_error = 0

    if typical_flight_error == 2:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Los typical flight del vuelo '+str(flight)+' no son todos iguales.\n'
        log.write(message)
        log.close()
        typical_flight_error = 0

    if SP_error == 1:
        today = datetime.datetime.today()
        time = today.strftime(" %Y-%m-%d %H.%M.%S")
        log = open('log.txt','aw')
        message = time+': Hay algún Step Percentage del vuelo '+str(flight)+' menor que 100.\n'
        log.write(message)
        log.close()
        SP_error = 0

    conditions = 0

The if statements of the end are for checking and writing warning logs.

I'm using windows xp with 8 gb RAM and intel xeon w3505 (two cores, 2,53 GHz).

Upvotes: 12

Views: 24167

Answers (4)

Mike Kelly
Mike Kelly

Reputation: 1029

This approach worked for me, copying data from a SQLite DB into corresponding worksheets for each table Some of the tables have > 250,000 rows and I was running into a Memory Error from OpenPyXL. The trick is to incrementally save every 100K rows and then reopen the workbook - this seems to reduce memory usage. I do something very similar to what @sakiM is doing above. Here's part of my code that does this:

    row_num = 2   # row 1 previously populated with column names
    session = self.CreateDBSession()  # SQL Alchemy connection to SQLite
    for item in session.query(ormClass):
        col_num = 1
        for col_name in sorted(fieldsInDB):  # list of columns from the table being put into XL columns
            if col_name != "__mapper__":        # Something SQL Alchemy apparently adds...
                val = getattr(item, col_name)
                sheet.cell(row=row_num, column=col_num).value = val
                col_num += 1
        row_num += 1
        if row_num % self.MAX_ROW_CHUNK == 0:   # MAX_ROW_CHUNK = 100000 
            self.WriteChunk()

# Write this chunk and reload the workbook to work around OpenPyXL memory issues
def WriteChunk(self):
    print("Incremental save of %s" % self.XLSPath)
    self.SaveXLWorkbook()
    print("Reopening %s" % self.XLSPath)
    self.OpenXLWorkbook()

# Open the XL Workbook we are updating
def OpenXLWorkbook(self):
    if not self.workbook:
        self.workbook = openpyxl.load_workbook(self.XLSPath)
    return self.workbook

# Save the workbook
def SaveXLWorkbook(self):
    if self.workbook:
        self.workbook.save(self.XLSPath)
        self.workbook = None

Upvotes: 2

sakiM
sakiM

Reputation: 4932

As @anuragal said

openpyxl will store all the accessed cells into memory

Another way to handle this huge memory problem while looping every cell is Divide-and-conquer. The point is after reading enough cell, save the excel by wb.save(), then the past values will be removed from memory.

checkPointLine = 100 # choose a better number in your case.

excel = openpyxl.load_workbook(excelPath,data_only= True)
ws = excel.active
readingLine = 1

for rowNum in range(readingLine,max_row):
    row = ws[rowNum]
    first = row[0]
    currentRow = first.row
    #doing the things to this line content then mark `isDirty = True`

    if currentRow%checkPointLine == 0:
        if isDirty:
            #write back only changed content
            excel.save(excelPath)
            isDirty = False
        excel = openpyxl.load_workbook(excelPath)
        ws = excel.active
    readingLine = first.row

Upvotes: 4

Dmitriy Sintsov
Dmitriy Sintsov

Reputation: 4159

With recent versions of openpyxl one has to load and read huge source workbook with read_only=True argument, and create / write huge destination workbook with write_only=True mode:

https://openpyxl.readthedocs.io/en/latest/optimized.html

Upvotes: 6

Anurag
Anurag

Reputation: 3114

The default implementation of openpyxl will store all the accessed cells into memory. I will suggest you to use the Optimized reader (link - https://openpyxl.readthedocs.org/en/latest/optimized.html) instead

In code:-

wb = load_workbook(file_path, use_iterators = True)

While loading a workbook pass use_iterators = True. Then access the sheet and cells like:

for row in sheet.iter_rows():
    for cell in row:
        cell_text = cell.value

This will reduce the memory footprint to 5-10%

UPDATE: In version 2.4.0 use_iterators = True option is removed. In newer versions openpyxl.writer.write_only.WriteOnlyWorksheet is introduced for dumping large amounts of data.

from openpyxl import Workbook
wb = Workbook(write_only=True)
ws = wb.create_sheet()

# now we'll fill it with 100 rows x 200 columns
for irow in range(100):
    ws.append(['%d' % i for i in range(200)])

# save the file
wb.save('new_big_file.xlsx') 

Not tested the below code just copied from the above link.

Thanks @SdaliM for the information.

Upvotes: 11

Related Questions