Jeril
Jeril

Reputation: 8521

python: converting corrupt xls file

I have downloaded few sales dataset from a SAP application. SAP has automatically converted the data to .XLS file. Whenever I open it using Pandas library I am getting the following error:

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '\xff\xfe\r\x00\n\x00\r\x00'

When I opened the .XLS file using MSEXCEL it is shows a popup saying that the file is corrupt or unsupported extension do you want to continue when I clicked 'Yes' its showing the correct data. When I saved the file again as .xls using msexcel I am able to use it using Pandas.

So, I tried renaming the file using os.rename() but it dint work. I tried opening the file and removing \xff\xfe\r\x00\n\x00\r\x00, but then also it dint work.

The solution is to open MSEXCEL and save the file again as .xls manually, is there any way to automate this. Kindly help.

Upvotes: 2

Views: 11870

Answers (3)

Yaron
Yaron

Reputation: 1852

Following @Jeril's answer, we can skip the saving file process and load it directly to pandas.

import pandas as pd
import io
    

import pandas as pd
import io

file_path = 'data/some_file.xls'

# Load the file with utf-16 encoding
with io.open(file_path, "r", encoding="utf-16") as file1:
    data = file1.readlines()

# Split values by tab and create a DataFrame
data_list = [row.replace('\n', '').split('\t') for row in data]

# Extract column names (first list in the data_list)
column_names = data_list[0]

# Create a DataFrame using the remaining rows and column names
df = pd.DataFrame(data_list[1:], columns=column_names)
display(df)

Upvotes: 0

Tomasz Łukaszczyk
Tomasz Łukaszczyk

Reputation: 21

The other way to solve this problem is using win32com.client library:

import win32com.client
import os

o = win32com.client.Dispatch("Excel.Application")
o.Visible = False

filename = os.getcwd() + '/' + 'SALEJAN17.xls'
output = os.getcwd() + '/' + 'myexcel.xlsx'

wb = o.Workbooks.Open(filename)
wb.ActiveSheet.SaveAs(output,51)

In my example you save to .xlsx format but you can save as .xls as well.

Upvotes: 2

Jeril
Jeril

Reputation: 8521

Finally I converted the corrupt .xls to a correct .xls file. The following is the code:

# Changing the data types of all strings in the module at once
from __future__ import unicode_literals
# Used to save the file as excel workbook
# Need to install this library
from xlwt import Workbook
# Used to open to corrupt excel file
import io

filename = r'SALEJAN17.xls'
# Opening the file using 'utf-16' encoding
file1 = io.open(filename, "r", encoding="utf-16")
data = file1.readlines()

# Creating a workbook object
xldoc = Workbook()
# Adding a sheet to the workbook object
sheet = xldoc.add_sheet("Sheet1", cell_overwrite_ok=True)
# Iterating and saving the data to sheet
for i, row in enumerate(data):
    # Two things are done here
    # Removeing the '\n' which comes while reading the file using io.open
    # Getting the values after splitting using '\t'
    for j, val in enumerate(row.replace('\n', '').split('\t')):
        sheet.write(i, j, val)

# Saving the file as an excel file
xldoc.save('myexcel.xls')

import pandas as pd
df = pd.ExcelFile('myexcel.xls').parse('Sheet1')

No errors.

Upvotes: 5

Related Questions