Python code to change all xlsx files in directory to CSV corrupting original file

Question

I am writing a python code to change all files in a directory to csv then fill in missing time stamps with a row with NaN values and save to a new file. The code is working however it is corrupting my original file for some reason.

Excel error when trying to open original file

"excel cannot open the file because file format or extention is not valid. Verify that the file has not been corrupted and the extension matches the format of the file"

Here is the code I am using

import os, re
import pandas as pd
import numpy as np

inputdirectory = input('Enter the directory: ')
directory = os.listdir(inputdirectory)
os.chdir(inputdirectory)

for file in directory:
    data_xls = pd.read_excel(file, 'Sheet2', index_col=None)
    data_xls.to_csv(file, encoding='utf-8', index=False)
    df = pd.read_csv(file, index_col="DateTime", parse_dates=True)
    df = df.resample('1min').mean()
    df = df.reindex(pd.date_range(df.index.min(), df.index.max(), freq="1min"))
    df.to_csv(os.path.basename(file) + "-processed.csv", index=True, index_label="DateTime", na_rep='NaN')

Jean-Fran&#231;ois Fabre · Accepted Answer

data_xls = pd.read_excel(file, 'Sheet2', index_col=None)
data_xls.to_csv(file, encoding='utf-8', index=False)

means that you're dumping csv data using your xlsx filename.

Excel doesn't recognize it's csv because the extension is xlsx and the file appears as corrupt (and you actually lose data when doing that: renaming back to .csv allows to read the data, but you lose formatting, other sheets if any...). But pandas happily reads back your csv data even if it's called .xlsx, so you think your code works fine.

Solution: Use another name for that temporary file.

My humble non-expert fix:

import glob,os

inputdirectory = input('Enter the directory: ')

for xls_file in glob.glob(os.path.join(inputdirectory,"*.xls*")):
    data_xls = pd.read_excel(xls_file, 'Sheet2', index_col=None)
    csv_file = os.path.splitext(xls_file)[0]+".csv"
    data_xls.to_csv(csv_file, encoding='utf-8', index=False)

As a bonus, the script doesn't use os.chdir and processes only xls files. It uses the same radix name for .csv temp files, but doesn't overwrite the .xls files.

Aside: I'm not a pandas expert, but I'm sure this temporary file can be avoided somehow.

Python code to change all xlsx files in directory to CSV corrupting original file

Answers (1)

Related Questions