jayaprakash k
jayaprakash k

Reputation: 21

Polars ignoring rows that are empty in Excel

Closed Polars ignoring rows that are empty in Excel #18250 Labels bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars Description @Rashik-raj Rashik-raj opened on Aug 18, 2024 · edited by Rashik-raj Checks

I have checked that this issue has not already been reported.

I have confirmed this bug exists on the latest version of Polars. Reproducible example import polars as pl import pandas as pd

def read_excel_using_polars(path): return pl.read_excel(source=path)

def read_csv_using_polars(path): return pl.read_csv(source=path)

def read_excel_using_pandas(path): return pd.read_excel(io=path)

def read_csv_using_pandas(path): return pd.read_csv(path)

pl_df_excel = read_excel_using_polars("test_empty_rows.xlsx") pd_df_excel =read_excel_using_pandas("test_empty_rows.xlsx")

pl_df_csv = read_csv_using_polars("test_empty_rows.csv") pd_df_csv = read_csv_using_pandas("test_empty_rows.csv")

print("Reading excel from polars", end="\n\n") print(pl_df_excel)

print("\nReading excel from pandas", end="\n\n") print(pd_df_excel)

print("\nReading csv from polars", end="\n\n") print(pl_df_csv)

print("\nReading csv from pandas", end="\n\n") print(pd_df_csv) test_empty_rows.xlsx test_empty_rows.csv

Log output Reading excel from polars

shape: (8, 4) ┌──────┬──────┬──────┬─────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ null ┆ i64 │ ╞══════╪══════╪══════╪═════╡ │ 1 ┆ 1 ┆ null ┆ 1 │ │ 2 ┆ null ┆ null ┆ 2 │ │ 3 ┆ 3 ┆ null ┆ 3 │ │ null ┆ 4 ┆ null ┆ 4 │ │ 5 ┆ 5 ┆ null ┆ 5 │ │ 6 ┆ 6 ┆ null ┆ 6 │ │ 9 ┆ 9 ┆ null ┆ 9 │ │ 10 ┆ 10 ┆ null ┆ 0 │ └──────┴──────┴──────┴─────┘

Reading excel from pandas

  a     b   c    d

0 1.0 1.0 NaN 1.0 1 2.0 NaN NaN 2.0 2 3.0 3.0 NaN 3.0 3 NaN 4.0 NaN 4.0 4 5.0 5.0 NaN 5.0 5 6.0 6.0 NaN 6.0 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN 8 9.0 9.0 NaN 9.0 9 10.0 10.0 NaN 0.0

Reading csv from polars

shape: (10, 4) ┌──────┬──────┬──────┬──────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str ┆ i64 │ ╞══════╪══════╪══════╪══════╡ │ 1 ┆ 1 ┆ null ┆ 1 │ │ 2 ┆ null ┆ null ┆ 2 │ │ 3 ┆ 3 ┆ null ┆ 3 │ │ null ┆ 4 ┆ null ┆ 4 │ │ 5 ┆ 5 ┆ null ┆ 5 │ │ 6 ┆ 6 ┆ null ┆ 6 │ │ null ┆ null ┆ null ┆ null │ │ null ┆ null ┆ null ┆ null │ │ 9 ┆ 9 ┆ null ┆ 9 │ │ 10 ┆ 10 ┆ null ┆ 0 │ └──────┴──────┴──────┴──────┘

Reading csv from pandas

  a     b   c    d

0 1.0 1.0 NaN 1.0 1 2.0 NaN NaN 2.0 2 3.0 3.0 NaN 3.0 3 NaN 4.0 NaN 4.0 4 5.0 5.0 NaN 5.0 5 6.0 6.0 NaN 6.0 6 NaN NaN NaN NaN 7 NaN NaN NaN NaN 8 9.0 9.0 NaN 9.0 9 10.0 10.0 NaN 0.0 Issue description We were using pandas to import files in our FastApi server that imports Excel, CSV, or XPT files. Due to lack of performance, we switched to Polars (had issue running in uvicorn as it is thread based server, made a workaround). We noticed that Polars is dropping rows that are entirely empty for Excel files.

To double check, I tested with pandas as well, and ensured that pandas do not drop empty rows. I went through polars documentation, but with no luck. performed some debugging in polars package and found _drop_null_data to be the culprit. I had monkeypatched _drop_null_data to return entire dataframe in an attempt to not loose empty rows as a workaround.

import polars as pl

def _custom_drop_null_data(df, raise_if_empty): return df

pl.io.spreadsheet.functions._drop_null_data = _custom_drop_null_data df = pl.read_excel(source="test_empty_rows.xlsx") # we get empty rows as well Also polars seems to be dropping empty rows only for Excel files which you can see from the reproducible sample. So to keep backward compatibility just in case it is desired feature, I am proposing to use drop_empty_rows parameter in _drop_null_data that defaults to True, and can be switched when reading Excel files as pl.read_excel(source=path, drop_empty_rows=False). I have also verified that this happens with every engine of Excel.

This is how patched _drop_null_data will look like:

def _drop_null_data(df: pl.DataFrame, *, raise_if_empty: bool, drop_empty_rows: bool = True) -> pl.DataFrame: """ If DataFrame contains columns/rows that contain only nulls, drop them.

If `drop_empty_rows` is set to `False`, empty rows are not dropped.
"""
null_cols = []
for col_name in df.columns:
    # note that if multiple unnamed columns are found then all but the first one
    # will be named as "_duplicated_{n}" (or "__UNNAMED__{n}" from calamine)
    if col_name == "" or re.match(r"(_duplicated_|__UNNAMED__)\d+$", col_name):
        col = df[col_name]
        if (
            col.dtype == Null
            or col.null_count() == len(df)
            or (
                col.dtype in NUMERIC_DTYPES
                and col.replace(0, None).null_count() == len(df)
            )
        ):
            null_cols.append(col_name)
if null_cols:
    df = df.drop(*null_cols)

if len(df) == 0 and len(df.columns) == 0:
    return _empty_frame(raise_if_empty)
if drop_empty_rows:
    return df.filter(~F.all_horizontal(F.all().is_null()))
return df

Expected behavior To not drop empty rows conditionally based on new flag drop_empty_rows.

Upvotes: 0

Views: 51

Answers (0)

Related Questions