Reputation: 10485
I'm trying to parse an .xls file. I tried:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import sys
print 'Python version ' + sys.version
print 'Pandas version: ' + pd.__version__
# Parse a specific sheet
df = pd.read_excel('NextDebitCreditCard.xls', 0, index_col='StatusDate')
df.dtypes
But I keep getting
File "/usr/lib/python2.7/dist-packages/xlrd/book.py", line 1252, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<html la'
I got the same error with xlrd. I'm not sure if it's a regular xls file, so I'm adding the beginning and the end of the file here:
<html lang="he">
<head>
<META CONTENT="text/html" HTTP-EQUIV="Content-Type" charset="iso-8859-8"></META><META CONTENT="no-cache" HTTP-EQUIV="Pragma"></META><META CONTENT="0" HTTP-EQUIV="expires"></META><title>
<TEXT>
some text here
.....
.....
.....
.....
₪ 942.56</td></tr></table>
</div>
</div></td><td class="homeMessagesTd" id="leftSide">
</td></tr></table></form></body></html>
Any ideas? thanks!
Upvotes: 1
Views: 2241
Reputation: 11
Finally I found solution for this. If you download file from internet, it may not be xls format. But it is a file readable in xls. If you try pd.read_html, it may not work either because it will not find any tables.
Solution: Try:
pd.read_csv('filename.xls', sep ='\t')
Upvotes: 1
Reputation: 60756
From the comments I can see you realize this is not a 'real' Excel file, but rather, is an HTML file saved with the .xls extension. Since you don't provide us a full file we can only guess what may, or may not, work.
I'd start with the HTML parsing tools in Pandas:
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-html
you could try:
df = pd.read_html('NextDebitCreditCard.xls')
If that does not get you close, it may be time to get into beautifulsoup.
Upvotes: 1