user554319
user554319

Reputation:

Replace value by null in Polars

Given a Polars DataFrame, is there a way to replace a particular value by "null"? For example, if there's a sentinel value like "_UNKNOWN" and I want to make it truly missing in the dataframe instead.

Upvotes: 3

Views: 11492

Answers (4)

jqurious
jqurious

Reputation: 21580

Update: Expr.replace() has also since been added to Polars.

df.with_columns(pl.col(pl.String).replace("_UNKNOWN", None))
shape: (4, 3)
┌──────┬──────┬─────┐
│ A    ┆ B    ┆ C   │
│ ---  ┆ ---  ┆ --- │
│ str  ┆ str  ┆ i64 │
╞══════╪══════╪═════╡
│ a    ┆ null ┆ 1   │
│ b    ┆ d    ┆ 2   │
│ null ┆ e    ┆ 3   │
│ c    ┆ f    ┆ 4   │
└──────┴──────┴─────┘

You can use .when().then().otherwise()

pl.col(pl.String) is used to select all "string columns".

df = pl.DataFrame({
   "A": ["a", "b", "_UNKNOWN", "c"], 
   "B": ["_UNKNOWN", "d", "e", "f"], 
   "C": [1, 2, 3, 4]
})

df.with_columns(
   pl.when(pl.col(pl.String) == "_UNKNOWN")
     .then(None)
     .otherwise(pl.col(pl.String)) # keep original value
     .name.keep()
)

Upvotes: 8

glebcom
glebcom

Reputation: 1462

In addition to previous answers, if you have just imported dataset, you can specify what values will be null

Example data.csv:

A,B
a,1
b,2
_UNKNOWN,more than 3
c,_UNKNOWN

when importing dataset:

pl.read_csv("data.csv",
            null_values="_UNKNOWN",
            # infer_schema_length=0,  # optional (if you will get parsing error))

df:

┌──────┬─────────────┐
│ A    ┆ B           │
│ ---  ┆ ---         │
│ str  ┆ str         │
╞══════╪═════════════╡
│ a    ┆ 1           │
│ b    ┆ 2           │
│ null ┆ more than 3 │
│ c    ┆ null        │
└──────┴─────────────┘

Upvotes: 4

Dean MacGregor
Dean MacGregor

Reputation: 18691

This is really a tweak of @jqurious's answer.

When you do a when and your condition isn't met then the default is null so you can just do:

df.with_columns(
    pl.when(pl.col(pl.String) != "_UNKNOWN")
        .then(pl.col(pl.String)) # keep original value
        .name.keep()
)

If you have multiple null conditions say null_strings=["_UNKNOWN", "UNK", "who_knows"] then you can use a is_in like this:

df.with_columns(
    pl.when(~pl.col(pl.String).is_in(null_strings))
        .then(pl.col(pl.String)) # keep original value
        .name.keep()
)

Upvotes: 4

asterix
asterix

Reputation: 171

replace is all you need!

print(
    pl.DataFrame({
       "A": ["a", "b", "_UNKNOWN", "c"], 
       "B": ["_UNKNOWN", "d", "e", "f"], 
       "C": [1, 2, 3, 4]})
    .with_columns(pl.col(pl.String).replace({"_UNKNOWN": None}))
)
shape: (4, 3)
┌──────┬──────┬─────┐
│ A    ┆ B    ┆ C   │
│ ---  ┆ ---  ┆ --- │
│ str  ┆ str  ┆ i64 │
╞══════╪══════╪═════╡
│ a    ┆ null ┆ 1   │
│ b    ┆ d    ┆ 2   │
│ null ┆ e    ┆ 3   │
│ c    ┆ f    ┆ 4   │
└──────┴──────┴─────┘

Upvotes: 1

Related Questions