Reputation:
Given a Polars DataFrame, is there a way to replace a particular value by "null"? For example, if there's a sentinel value like "_UNKNOWN"
and I want to make it truly missing in the dataframe instead.
Upvotes: 3
Views: 11492
Reputation: 21580
Update:
Expr.replace()
has also since been added to Polars.
df.with_columns(pl.col(pl.String).replace("_UNKNOWN", None))
shape: (4, 3)
┌──────┬──────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════╪══════╪═════╡
│ a ┆ null ┆ 1 │
│ b ┆ d ┆ 2 │
│ null ┆ e ┆ 3 │
│ c ┆ f ┆ 4 │
└──────┴──────┴─────┘
You can use .when().then().otherwise()
pl.col(pl.String)
is used to select all "string columns".
df = pl.DataFrame({
"A": ["a", "b", "_UNKNOWN", "c"],
"B": ["_UNKNOWN", "d", "e", "f"],
"C": [1, 2, 3, 4]
})
df.with_columns(
pl.when(pl.col(pl.String) == "_UNKNOWN")
.then(None)
.otherwise(pl.col(pl.String)) # keep original value
.name.keep()
)
Upvotes: 8
Reputation: 1462
In addition to previous answers, if you have just imported dataset, you can specify what values will be null
Example data.csv
:
A,B
a,1
b,2
_UNKNOWN,more than 3
c,_UNKNOWN
when importing dataset:
pl.read_csv("data.csv",
null_values="_UNKNOWN",
# infer_schema_length=0, # optional (if you will get parsing error))
df
:
┌──────┬─────────────┐
│ A ┆ B │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪═════════════╡
│ a ┆ 1 │
│ b ┆ 2 │
│ null ┆ more than 3 │
│ c ┆ null │
└──────┴─────────────┘
Upvotes: 4
Reputation: 18691
This is really a tweak of @jqurious's answer.
When you do a when
and your condition isn't met then the default is null
so you can just do:
df.with_columns(
pl.when(pl.col(pl.String) != "_UNKNOWN")
.then(pl.col(pl.String)) # keep original value
.name.keep()
)
If you have multiple null conditions say null_strings=["_UNKNOWN", "UNK", "who_knows"]
then you can use a is_in
like this:
df.with_columns(
pl.when(~pl.col(pl.String).is_in(null_strings))
.then(pl.col(pl.String)) # keep original value
.name.keep()
)
Upvotes: 4
Reputation: 171
replace
is all you need!
print(
pl.DataFrame({
"A": ["a", "b", "_UNKNOWN", "c"],
"B": ["_UNKNOWN", "d", "e", "f"],
"C": [1, 2, 3, 4]})
.with_columns(pl.col(pl.String).replace({"_UNKNOWN": None}))
)
shape: (4, 3)
┌──────┬──────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════╪══════╪═════╡
│ a ┆ null ┆ 1 │
│ b ┆ d ┆ 2 │
│ null ┆ e ┆ 3 │
│ c ┆ f ┆ 4 │
└──────┴──────┴─────┘
Upvotes: 1