Reputation: 3105
I'm having problems reading .xls files written by a Perl script which I have no control over. The files contain some formatting and line breaks within cells.
filename = '/home/shared/testfile.xls'
book = xlrd.open_workbook(filename)
sheet = book.sheet_by_index(0)
for rowIndex in xrange(1, sheet.nrows):
row = sheet.row(rowIndex)
This is throwing the following error:
_locate_stream(Workbook): seen
0 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
20 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172480= 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
172500 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 3 2
172520 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173840= 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
173860 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1
173880 1 1 1 1 1 1 1 1
Traceback (most recent call last):
File "/home/shared/xlrdtest.py", line 5, in <module>
book = xlrd.open_workbook(filename)
File "/usr/local/lib/python2.7/site-packages/xlrd/__init__.py", line 443, in open_workbook
ragged_rows=ragged_rows,
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 84, in open_workbook_xls
ragged_rows=ragged_rows,
File "/usr/local/lib/python2.7/site-packages/xlrd/book.py", line 616, in biff2_8_load
self.mem, self.base, self.stream_len = cd.locate_named_stream(qname)
File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 393, in locate_named_stream
d.tot_size, qname, d.DID+6)
File "/usr/local/lib/python2.7/site-packages/xlrd/compdoc.py", line 421, in _locate_stream
raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
xlrd.compdoc.CompDocError: Workbook corruption: seen[2] == 4
I'm not able to find any info about CompDocError or Workbook corruption, even less the seen[2] == 4 part.
Upvotes: 19
Views: 25317
Reputation: 6569
[For pandas users who have trouble with Workbook corruption: seen[2] == 4
] when reading XLS files
NOTE: xlrd no longer supports anything other than xls
files, see this answer for details and alternatives.
Use the xlrd
2.0.0 or above: https://pypi.org/project/xlrd/.
Add your own XLS engine for pandas in one of your __init__.py
files which loads automatically
from pandas.io.excel._xlrd import XlrdReader
class CustomXlrdReader(XlrdReader):
def load_workbook(self, filepath_or_buffer):
"""Same as original, just uses ignore_workbook_corruption=True)"""
from xlrd import open_workbook
if hasattr(filepath_or_buffer, "read"):
data = filepath_or_buffer.read()
return open_workbook(file_contents=data, ignore_workbook_corruption=True)
else:
return open_workbook(filepath_or_buffer)
ExcelFile._engines['custom_xlrd'] = CustomXlrdReader
print('Monkey patching pandas XLS engines. See CustomXlrdReader')
To use it, specify engine='custom_xlrd'
when you read XLS file
df = pd.read_excel(filepath, engine='custom_xlrd')
Here's the corrupted file that xlrd
is tested against:
https://github.com/python-excel/xlrd/blob/master/tests/samples/corrupted_error.xls
Upvotes: 7
Reputation: 301
Inspired by this answer, but a bit more straigthforward:
pd.read_excel accepts xlrd.Book objects so you could do:
workbook = xlrd.open_workbook('file_name.xls', ignore_workbook_corruption=True)
excel = pd.read_excel(workbook)
Upvotes: 13
Reputation: 1
testfile.xls -----> Save As ---> Format 97-2003 --->testfile2.xlc
filename = '/home/shared/testfile2.xls'
OK
Upvotes: -1
Reputation: 151
From pkm comment (link) the problem is with a Compound File Binary
#pip install OleFileIO-PL
import OleFileIO_PL
import pandas as pd
path = 'file.xls'
with open(path,'rb') as file:
ole = OleFileIO_PL.OleFileIO(file)
if ole.exists('Workbook'):
d = ole.openstream('Workbook')
x=pd.read_excel(d,engine='xlrd')
print(x.head())
Upvotes: 15
Reputation: 109
+1 to Ramiel.
Just comment out these lines in compdoc.py
(lines 425-27
in xlrd 1.2.0
):
if self.seen[s]:
print("_locate_stream(%s): seen" % qname, file=self.logfile);dump_list(self.seen, 20, self.logfile)
raise CompDocError("%s corruption: seen[%d] == %d" % (qname, s, self.seen[s]))
Upvotes: 10
Reputation: 1632
I got same error with one of my .xls files (excel can open them just fine). Problem is located in xlrd compdoc.py
. As I guess Compdoc.seen
array keeps track of already read "FAT" sectors. In my case Root Entry reading block (SSCS) gets all that sectors marked as seen, leading to exception raise in future. U can try to find the bug in sectors reading logic and contribute to xlrd :) or just comment this lines with exception raise which will likely solve problem in your case (As did in mine) and wait for xlrd update.
Upvotes: 2
Reputation: 3721
Maybe as a last resort you can try and save the .xls file as a .csv file and then try and read it.
Obviously you say you can open it after an open and close from Excel so it's the same effort.
If you realy want your script to open them, then if you are on the windows platform use the pywin32 to open and close Excel from your script, and open and close the file in the same go. Maybe that could work. (silly but could be a work around)
Example (stolen from here)
from win32com.client import Dispatch
xl = Dispatch('Excel.Application')
wb = xl.Workbooks.Open('C:\\Documents and Settings\\GradeBook.xls')
Upvotes: 0