TomNorway
TomNorway

Reputation: 3182

Overwriting inf in many columns

I have a dataframe with many columns that have occurances of inf. I'd like to replace these with null. All of the column names in question start with the string "ratio_".

This is what I've tried, but I get new columns with the title "literal", when I would like to replace the old ones:

import polars as pl
import numpy as np
df = pl.DataFrame({"label":["a", "b", "c"], "ratio_a":[1.,2.,np.inf]})

df.with_columns(
    pl.when(pl.col("^ratio_\w+$").is_infinite())
    .then(None)
    .otherwise(pl.col("^ratio_\w+$")
    )
)

shape: (3, 3)
┌───────┬─────────┬─────────┐
│ label ┆ ratio_a ┆ literal │
│ ---   ┆ ---     ┆ ---     │
│ str   ┆ f64     ┆ f64     │
╞═══════╪═════════╪═════════╡
│ a     ┆ 1.0     ┆ 1.0     │
│ b     ┆ 2.0     ┆ 2.0     │
│ c     ┆ inf     ┆ null    │
└───────┴─────────┴─────────┘

I could solve this by looping over the column names instead, but I was surprised at the above behaviour. Is there a way to make the above expression work?

import numpy as np
df = pl.DataFrame({"label":["a", "b", "c"], "ratio_a":[1.,2.,np.inf]})

for col in df.columns:
    if col.startswith("ratio_"):
        df = df.with_columns(
            pl.when(pl.col(col).is_infinite())
            .then(None)
            .otherwise(pl.col(col)
            ).alias(col)
        )
df

Upvotes: 5

Views: 4059

Answers (2)

jqurious
jqurious

Reputation: 21590

Expr.replace() has since been added, which may also be an option.

df.with_columns(
    pl.col("^ratio\w+$").replace([float("inf"), float("-inf")], None)
)
shape: (3, 5)
┌───────┬─────────┬─────────┬─────────┬─────────┐
│ label ┆ ratio_a ┆ ratio_b ┆ ratio_c ┆ ratio_d │
│ ---   ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str   ┆ f64     ┆ f64     ┆ f64     ┆ f64     │
╞═══════╪═════════╪═════════╪═════════╪═════════╡
│ a     ┆ 1.0     ┆ 10.0    ┆ null    ┆ -1.0    │
│ b     ┆ 2.0     ┆ null    ┆ 200.0   ┆ -2.0    │
│ c     ┆ null    ┆ 20.0    ┆ 300.0   ┆ null    │
└───────┴─────────┴─────────┴─────────┴─────────┘

Upvotes: 2

user18559875
user18559875

Reputation:

You were on the right track. You can use the .name.keep() expression.

Let's expand your example.

import polars as pl
import numpy as np

df = pl.DataFrame({"label": ["a", "b", "c"],
                   "ratio_a": [1.0, 2.0, np.inf],
                   "ratio_b": [10.0, np.inf, 20.0],
                   "ratio_c": [np.inf, 200, 300],
                   "ratio_d": [-1.0, -2.0, np.inf],
                   })
df

shape: (3, 5)
┌───────┬─────────┬─────────┬─────────┬─────────┐
│ label ┆ ratio_a ┆ ratio_b ┆ ratio_c ┆ ratio_d │
│ ---   ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str   ┆ f64     ┆ f64     ┆ f64     ┆ f64     │
╞═══════╪═════════╪═════════╪═════════╪═════════╡
│ a     ┆ 1.0     ┆ 10.0    ┆ inf     ┆ -1.0    │
│ b     ┆ 2.0     ┆ inf     ┆ 200.0   ┆ -2.0    │
│ c     ┆ inf     ┆ 20.0    ┆ 300.0   ┆ inf     │
└───────┴─────────┴─────────┴─────────┴─────────┘

To your expression, just append .name.keep().

df.with_columns(
    pl.when(pl.col(r"^ratio_\w+$").is_infinite())
    .then(None)
    .otherwise(pl.col(r"^ratio_\w+$"))
    .name.keep()
)
shape: (3, 5)
┌───────┬─────────┬─────────┬─────────┬─────────┐
│ label ┆ ratio_a ┆ ratio_b ┆ ratio_c ┆ ratio_d │
│ ---   ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ str   ┆ f64     ┆ f64     ┆ f64     ┆ f64     │
╞═══════╪═════════╪═════════╪═════════╪═════════╡
│ a     ┆ 1.0     ┆ 10.0    ┆ null    ┆ -1.0    │
│ b     ┆ 2.0     ┆ null    ┆ 200.0   ┆ -2.0    │
│ c     ┆ null    ┆ 20.0    ┆ 300.0   ┆ null    │
└───────┴─────────┴─────────┴─────────┴─────────┘

As an aside, you should probably prepend r to your regex string: r"^ratio_\w+$" to avoid a deprecation warning in Python.

Upvotes: 8

Related Questions