snurre
snurre

Reputation: 3105

Reading excel files with xlrd

I'm having problems reading .xls files written by a Perl script which I have no control over. The files contain some formatting and line breaks within cells.

filename = '/home/shared/testfile.xls'
book = xlrd.open_workbook(filename)
sheet = book.sheet_by_index(0)
for rowIndex in xrange(1, sheet.nrows):
    row = sheet.row(rowIndex)

This is throwing the following error:

_locate_stream(Workbook): seen
    0  5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
   20  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172480= 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172500  4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2
172520  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173840= 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173860  2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
173880  1 1 1 1 1 1 1 1
Traceback (most recent call last):
  File "/home/shared/xlrdtest.py", line 5, in <module>
    book = xlrd.open_workbook(filename)
  File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 443, in open_workbook
    ragged_rows=ragged_rows,
  File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 84, in open_workbook_xls
    ragged_rows=ragged_rows,
  File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 616, in biff2_8_load
    self.mem, self.base, self.stream_len = cd.locate_named_stream(qname)
  File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 393, in locate_named_stream
    d.tot_size, qname, d.DID+6)
  File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 421, in _locate_stream
    raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4

I'm not able to find any info about CompDocError or Workbook corruption, even less the seen[2] == 4 part.

Upvotes: 19

Views: 25317

Answers (7)

pymen
pymen

Reputation: 6569

[For pandas users who have trouble with Workbook corruption: seen[2] == 4] when reading XLS files

NOTE: xlrd no longer supports anything other than xls files, see this answer for details and alternatives.

  1. Use the xlrd 2.0.0 or above: https://pypi.org/project/xlrd/.

  2. Add your own XLS engine for pandas in one of your __init__.py files which loads automatically

from pandas.io.excel._xlrd import XlrdReader

class CustomXlrdReader(XlrdReader):

    def load_workbook(self, filepath_or_buffer):
        """Same as original, just uses ignore_workbook_corruption=True)"""
        from xlrd import open_workbook

        if hasattr(filepath_or_buffer, "read"):
            data = filepath_or_buffer.read()
            return open_workbook(file_contents=data, ignore_workbook_corruption=True)
        else:
            return open_workbook(filepath_or_buffer)


ExcelFile._engines['custom_xlrd'] = CustomXlrdReader
print('Monkey patching pandas XLS engines. See CustomXlrdReader')

To use it, specify engine='custom_xlrd' when you read XLS file

df = pd.read_excel(filepath, engine='custom_xlrd')

Here's the corrupted file that xlrd is tested against:

https://github.com/python-excel/xlrd/blob/master/tests/samples/corrupted_error.xls

Upvotes: 7

Alfonso
Alfonso

Reputation: 301

Inspired by this answer, but a bit more straigthforward:

pd.read_excel accepts xlrd.Book objects so you could do:

workbook = xlrd.open_workbook('file_name.xls', ignore_workbook_corruption=True)
excel = pd.read_excel(workbook)

Upvotes: 13

user12354862
user12354862

Reputation: 1

testfile.xls -----> Save As ---> Format 97-2003 --->testfile2.xlc

filename = '/home/shared/testfile2.xls'

OK

Upvotes: -1

holymoly345
holymoly345

Reputation: 151

From pkm comment (link) the problem is with a Compound File Binary

#pip install OleFileIO-PL
import OleFileIO_PL
import pandas as pd

path = 'file.xls'
with open(path,'rb') as file:
    ole = OleFileIO_PL.OleFileIO(file)
    if ole.exists('Workbook'):
        d = ole.openstream('Workbook')
        x=pd.read_excel(d,engine='xlrd')
        print(x.head())

Upvotes: 15

Shatalov Vadim
Shatalov Vadim

Reputation: 109

+1 to Ramiel. Just comment out these lines in compdoc.py (lines 425-27 in xlrd 1.2.0):

if self.seen[s]:
    print("_locate_stream(%s): seen" % qname, file=self.logfile);dump_list(self.seen, 20, self.logfile)
    raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))

Upvotes: 10

glmvrml
glmvrml

Reputation: 1632

I got same error with one of my .xls files (excel can open them just fine). Problem is located in xlrd compdoc.py. As I guess Compdoc.seen array keeps track of already read "FAT" sectors. In my case Root Entry reading block (SSCS) gets all that sectors marked as seen, leading to exception raise in future. U can try to find the bug in sectors reading logic and contribute to xlrd :) or just comment this lines with exception raise which will likely solve problem in your case (As did in mine) and wait for xlrd update.

Upvotes: 2

Tooblippe
Tooblippe

Reputation: 3721

Maybe as a last resort you can try and save the .xls file as a .csv file and then try and read it.

Obviously you say you can open it after an open and close from Excel so it's the same effort.

If you realy want your script to open them, then if you are on the windows platform use the pywin32 to open and close Excel from your script, and open and close the file in the same go. Maybe that could work. (silly but could be a work around)

Example (stolen from here)

from win32com.client import Dispatch
xl = Dispatch('Excel.Application')
wb = xl.Workbooks.Open('C:\\Documents and Settings\\GradeBook.xls')

Upvotes: 0

Related Questions