Yotam
Yotam

Reputation: 10485

Python: parsing .xls file failed with both xlrd and pandas

I'm trying to parse an .xls file. I tried:

# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys

print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__

# Parse a specific sheet
df = pd.read_excel('NextDebitCreditCard.xls', 0, index_col='StatusDate')
df.dtypes

But I keep getting

  File "/usr/lib/python2.7/dist-packages/xlrd/book.py", line 1252, in bof_error
    raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<html la'

I got the same error with xlrd. I'm not sure if it's a regular xls file, so I'm adding the beginning and the end of the file here:

<html lang="he">
<head>
<META CONTENT="text/html" HTTP-EQUIV="Content-Type" charset="iso-8859-8"></META><META CONTENT="no-cache" HTTP-EQUIV="Pragma"></META><META CONTENT="0" HTTP-EQUIV="expires"></META><title>
<TEXT>
                some text here
.....
.....
.....
.....
 &#8362; 942.56</td></tr></table>
        </div>
        </div></td><td class="homeMessagesTd" id="leftSide">                
                            </td></tr></table></form></body></html>

Any ideas? thanks!

Upvotes: 1

Views: 2241

Answers (2)

Deep Panjwani
Deep Panjwani

Reputation: 11

Finally I found solution for this. If you download file from internet, it may not be xls format. But it is a file readable in xls. If you try pd.read_html, it may not work either because it will not find any tables.

Solution: Try:

pd.read_csv('filename.xls', sep ='\t')

Upvotes: 1

JD Long
JD Long

Reputation: 60756

From the comments I can see you realize this is not a 'real' Excel file, but rather, is an HTML file saved with the .xls extension. Since you don't provide us a full file we can only guess what may, or may not, work.

I'd start with the HTML parsing tools in Pandas:

http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-html

you could try:

df = pd.read_html('NextDebitCreditCard.xls')

If that does not get you close, it may be time to get into beautifulsoup.

Upvotes: 1

Related Questions