How can I remove nulls in the process of unpivoting a polars dataframe?

Question

I have a large polars dataframe that I need to unpivot. This dataframe contains lots of null values (at least half). I want to drop the nulls while unpivoting the dataframe. I already tried to unpivot the dataframe first and then filter it with drop_nulls() or similar approaches. However, this is too memory-intensive (on a machine with about 1TB RAM).

Is there any way in which I can filter the dataset already during the process of unpivot?

Any help is appreciated!

Sample data.

# in reality, this dataset has about 160k rows and columns 
# (square matrix), and is about 100GB
df = {
    "A": [None, 2, 3],
    "B": [None, None, 2],
    "C": [None, None, None], 
    "names": ["A", "B", "C"]
}
df = pl.DataFrame(df)

df.unpivot(index = "names", variable_name = "names_2", value_name = "distance")

Output.

shape: (9, 3)
┌───────┬─────────┬──────────┐
│ names ┆ names_2 ┆ distance │
│ ---   ┆ ---     ┆ ---      │
│ str   ┆ str     ┆ i64      │
╞═══════╪═════════╪══════════╡
│ A     ┆ A       ┆ null     │
│ B     ┆ A       ┆ null     │
│ C     ┆ A       ┆ null     │
│ A     ┆ B       ┆ 2        │
│ B     ┆ B       ┆ null     │
│ C     ┆ B       ┆ null     │
│ A     ┆ C       ┆ 3        │
│ B     ┆ C       ┆ 2        │
│ C     ┆ C       ┆ null     │
└───────┴─────────┴──────────┘

This could then be filtered (e.g. using df = df.drop_nulls()), but I would like to get this desired result directly from the unpivot.

Expected output.

shape: (3, 3)
┌───────┬─────────┬──────────┐
│ names ┆ names_2 ┆ distance │
│ ---   ┆ ---     ┆ ---      │
│ str   ┆ str     ┆ i64      │
╞═══════╪═════════╪══════════╡
│ A     ┆ B       ┆ 2        │
│ A     ┆ C       ┆ 3        │
│ B     ┆ C       ┆ 2        │
└───────┴─────────┴──────────┘

Hericks · Accepted Answer

Very likely you can make the operation more efficient in runtime and memory consumption by using pl.LazyFrames and polars' streaming engine.

By using pl.LazyFrames, the melt / unpivot and filter / drop_nulls won't be operations eagerly, but first aggregated into a query plan. When collecting the lazy DataFrame (i.e. materialising it into a pl.DataFrame), the query plan can be optimised, taking into account subsequent operations.

Streaming will enable the processing to not done all-at-once, but executed in batches, ensuring that the processed batches don't grow larger-than-memory.

(
    df
    # convert to pl.LazyFrame
    .lazy()
    # create query plan
    .unpivot(
        index="names",
        variable_name="names_2",
        value_name="distance"
    )
    .drop_nulls()
    # collect pl.LazyFrame while using streaming engine
    .collect(streaming=True)
)

Note. Tentative tests on my machine give large improvements in runtime and memory consumptions.

How can I remove nulls in the process of unpivoting a polars dataframe?

Answers (2)

Related Questions