skytwosea
skytwosea

Reputation: 369

Can Polars with calamine engine be coerced into failing more gracefully?

I have 10s of thousands of excel files to which I'm applying validation using Polars. Some excel files have a problem that spawns an index out of bounds panic in the py03 runtime, when using engine=calamine. This issue does not occur when using engine=xlsx2csv. The excel problem is known and trivial, but due to the workflow pipeline at my company, I have little control on its occasional recurrence. So, I want to be able to handle this panic more gracefully.

A minimum working example is truly minimal, just call read_excel:

from pathlib import Path
import polars as pl

root = Path("/path/to/globdir")

def try_to_open():
    for file in root.rglob("*/*_fileID.xlsx"):
        print(f"\r{file.name}", end='')
        try:
            df = pl.read_excel(file, engine="calamine", infer_schema_length=0)
        except Exception as e:
            print(f"{file.name}: {e}", flush=True)

def main():
    try_to_open()

if __name__ == "__main__":
    main()

When a 'contaminated' excel file is processed, it fails like so:

thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/calamine-0.26.1/src/xlsx/cells_reader.rs:347:39:
index out of bounds: the len is 2585 but the index is 2585
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/path/to/script.py", line 18, in <module>
    main()
  File "/path/to/script.py", line 15, in main
    try_to_open()
  File "/path/to/script.py", line 10, in try_to_open
    df = pl.read_excel(file, engine="calamine", infer_schema_length=0)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.12/site-packages/polars/io/spreadsheet/functions.py", line 299, in read_excel
    return _read_spreadsheet(
           ^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.12/site-packages/polars/io/spreadsheet/functions.py", line 536, in _read_spreadsheet
    name: reader_fn(
          ^^^^^^^^^^
  File "/path/to/venv/lib/python3.12/site-packages/polars/io/spreadsheet/functions.py", line 951, in _read_spreadsheet_calamine
    ws_arrow = parser.load_sheet_eager(sheet_name, **read_options)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/venv/lib/python3.12/site-packages/fastexcel/__init__.py", line 394, in load_sheet_eager
    return self._reader.load_sheet(
           ^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: index out of bounds: the len is 2585 but the index is 2585

As you can see, the try:except block in the Python script does not catch the PanicException.

I want to be able to capture the name of the file that has failed. Is there a way to coerce Calamine's child threads to collapse and return a fail code, instead of crashing everything?

Upvotes: 1

Views: 71

Answers (1)

skytwosea
skytwosea

Reputation: 369

The pyo3_runtime.PanicException inherits from Python's BaseException, so using BaseException in the try:except block successfully catches the panic. It's quite a broad sweep but will do in a pinch.

See another SO q&a here

See the pyo3 docs here

Upvotes: 1

Related Questions