miroslaavi
miroslaavi

Reputation: 553

Polars read_excel issue with a date format

I'm trying to read rather messy excel file(s) to Polars dataframe, but getting a "XlsxValueError: Error: potential invalid date format.".

I believe that the issue is related to some date column values being in excel numerical date format that raises the error. Is there a way for me to put a setting to Polars or Xlsx2csv options that I want to read the columns as string types rather than try to convert to dates. I tried setting infer_schema_length to 0 to read the columns as strings, but looks like it is already the xlsx2csv writer raising the error.

My code currently as below:

pl.read_excel(file="file_path",
         sheet_name="sheet_name",
        read_csv_options={"infer_schema_length":0},
         xlsx2csv_options={"skip_hidden_rows":False})

Out:

XlsxValueError: Error: potential invalid date format.

When reading the data in with Pandas read_excel it does not raise errors. When trying to convert Pandas df to polars it raises an error:

df = pd.read_excel("file_name",
         sheet_name="sheet_name")
pl.from_pandas(df)

Out:

ArrowTypeError: Expected bytes, got a 'int' object

Current workaround (not ideal) I have is to read the data with Pandas in string format, and then convert to Polars dataframe and start cleaning the data.

Upvotes: 5

Views: 3765

Answers (1)

miroslaavi
miroslaavi

Reputation: 553

Credits to @jqurious as he found that you can pass format_type argument to the xlsx2csv_options which reads the date columns as strings.

Below the working code to read the data and avoid the error

pl.read_excel(source="file_path",
         sheet_name="sheet_name",
         xlsx2csv_options={"ignore_formats":["date"]})

EDIT 17/04/22:

Also, I've had excel files that are raising the same error despite I use the "ignore_formats":["date"]. Then the workaround is to ignore_format for "float" and read everything as strings in Xlsx2csv writer, as below:

pl.read_excel(
    source=r"path\file.xlsx",
    sheet_name="sheet_name",
    xlsx2csv_options={"skip_hidden_rows": False, "ignore_formats": ["float"]},
    read_csv_options={"infer_schema_length": 0},
)

Upvotes: 6

Related Questions