Reputation: 141
I'm trying to download and manipulate an xls file using urllib and xlrd.
The data is coming from url http://profiles.doe.mass.edu/search/search_export.aspx?orgCode=&orgType=5,12&runOrgSearch=Y&searchType=ORG&leftNavId=11238&showEmail=N
I'm using Python 2.7, xlrd 0.9.4, urllib 1.17, and I'm on a Mac.
I'm able to successfully download the file using this code.
saveLocation = home_dir+"/test/"
fileName = "data.xls"
page = <the url given above>
urllib.urlretrieve(page, saveLocation+fileName)
I then try to open the file using xlrd
wb = xlrd.open_workbook(saveLocation+fileName)
But get the error
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\r\n\r\n<htm'
This tells me that the file is not downloading as a true xls file. I can open the file in Excel and get no popup warnings or compatibility errors. Oddly enough, if I then save the file (in Excel) as Excel 97-2004, the xlrd error goes away. So it appears that Excel "fixes" whatever was wrong with the file.
So my question is, how do I "fix" the file in python or download the data in an appropriate format that xlrd will recognize?
I've also tried downloading the file as an xlsx file and using openpyxl but get a similar error. openpyxl says its not a valid zip file. I've also tried downloading the data using different methods such as requests.
Thanks.
EDIT: Using the information provided by @DSM, I was able to download and use the Excel file. Here's the code I used.
dfs = pd.read_html(fileLocation+fileName, index_col = 7, header=0)[0]
writer = pd.ExcelWriter(fileLocation+fileName)
dfs.to_excel(writer,"Sheet1")
writer.save()
I was then able to access the file as a true Excel file
ws = pd.read_excel(fileLocation+fileName, 0)
Upvotes: 2
Views: 1809
Reputation: 353099
As the <htm
bit should hint, this is really the data presented in xml fashion, despite the name .xls
. (It's almost always worth manually looking at the data header in your favourite editor to check to see what something actually is when it turns out to be hard to read.) Sometimes this can be a real nuisance to deal with, but fortunately here we can read it simply by using read_html
:
>>> url="http://profiles.doe.mass.edu/search/search_export.aspx?orgCode=&orgType=5,12&runOrgSearch=Y&searchType=ORG&leftNavId=11238&showEmail=N"
>>> dfs = pd.read_html(url)
>>> len(dfs)
1
>>> dfs[0].iloc[:5,:5]
0 1 \
0 Org Name Org Code
1 Abby Kelley Foster Charter Public (District) 04450000
2 Abington 00010000
3 Academy Of the Pacific Rim Charter Public (Dis... 04120000
4 Acton (non-op) 00020000
2 3 4
0 Org Type Function Contact Name
1 Charter District Charter School Leader Brian Haas
2 Public School District Superintendent Peter Schafer
3 Charter District Charter School Leader Chris Collins
4 Public School District Superintendent Glenn Brand
Looking more closely, we see that we can use the 0th row for the headers, and so:
>>> df = pd.read_html(url, header=0)[0]
>>> df.iloc[:5, :5]
Org Name Org Code \
0 Abby Kelley Foster Charter Public (District) 4450000
1 Abington 10000
2 Academy Of the Pacific Rim Charter Public (Dis... 4120000
3 Acton (non-op) 20000
4 Acton-Boxborough 6000000
Org Type Function Contact Name
0 Charter District Charter School Leader Brian Haas
1 Public School District Superintendent Peter Schafer
2 Charter District Charter School Leader Chris Collins
3 Public School District Superintendent Glenn Brand
4 Public School District Superintendent Glenn Brand
Upvotes: 2