Reputation: 181
Is there an elegant way how to recode values in polars dataframe.
For example
1->0,
2->0,
3->1...
in Pandas it is simple like that:
df.replace([1,2,3,4,97,98,99],[0,0,1,1,2,2,2])
Upvotes: 16
Views: 17674
Reputation: 12406
This could also be done by
polars.DataFrame
LEFT JOIN
ing the original data to the mapper DataFrame
Here is the code to show this
Data
df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
print(df)
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
└─────┘
Define the mapper dict
mapper = {1: 0, 2: 0, 3: 10, 4: 10}
Create a mapper DataFrame
df_mapper = pl.DataFrame([{"a": k, "values": v} for k, v in mapper.items()])
print(df_mapper)
shape: (4, 2)
┌─────┬────────┐
│ a ┆ values │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪════════╡
│ 1 ┆ 0 │
│ 2 ┆ 0 │
│ 3 ┆ 10 │
│ 4 ┆ 10 │
└─────┴────────┘
Use LEFT JOIN
and .fill_null()
to map the values, and then drop the original column
df = (
df
# LEFT JOIN
.join(df_mapper, on=["a"], how="left")
# fill missing values in mapped column with values from original column
.with_columns([pl.col("values").fill_null(pl.col("a"))])
# drop original column and replace with mapped column
.drop(["a"])
.rename({"values": "a"})
)
print(df)
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 0 │
│ 10 │
│ 10 │
│ 5 │
└─────┘
Upvotes: 2
Reputation: 14730
Polars has dedicated replace
and replace_strict
expressions.
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5]
})
mapper = {
1: 0,
2: 0,
3: 10,
4: 10
}
df.select(
pl.all().replace(mapper)
)
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 0 │
│ 10 │
│ 10 │
│ 5 │
└─────┘
In polars you can build columnar if else statetements
called if -> then -> otherwise
expressions.
So let's say we have this DataFrame
.
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5]
})
And we'd like to replace these with the following values:
from_ = [1, 2]
to_ = [99, 12]
We could write:
df.with_columns(
pl.when(pl.col("a") == from_[0])
.then(to_[0])
.when(pl.col("a") == from_[1])
.then(to_[1])
.otherwise(pl.col("a")).alias("a")
)
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 99 │
│ 12 │
│ 3 │
│ 4 │
│ 5 │
└─────┘
Now, this becomes very tedious to write really fast, so we could write a function that generates these expressions for use, we are programmers aren't we!
So to replace with the values you have suggested, you could do:
from_ = [1,2,3,4,97,98,99]
to_ = [0,0,1,1,2,2,2]
def replace(column, from_, to_):
# initiate the expression with `pl.when`
branch = pl.when(pl.col(column) == from_[0]).then(to_[0])
# for every value add a `when.then`
for (from_value, to_value) in zip(from_, to_):
branch = branch.when(pl.col(column) == from_value).then(to_value)
# finish with an `otherwise`
return branch.otherwise(pl.col(column)).alias(column)
df.with_columns(replace("a", from_, to_))
Which outputs:
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 0 │
│ 1 │
│ 1 │
│ 5 │
└─────┘
Upvotes: 29
Reputation: 10434
With the release of polars 0.19.16, map_dict
was deprecated in favour of pl.Expr.replace
.
import polars as pl
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5]
})
mapping_dict = {
1: 0,
2: 0,
3: 10,
4: 10
}
df.with_columns(
pl.all().replace(mapping_dict)
)
Output.
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
│ 0 │
│ 10 │
│ 10 │
│ 5 │
└─────┘
Note that the default behaviour has changed to keep any values not present in the mapping unchanged. To restore the previous default behaviour (and replace any values not present in the mapping with None
), pass default=None
as follows.
df.with_columns(
pl.all().replace(mapping_dict, default=None)
)
Output.
shape: (5, 1)
┌──────┐
│ a │
│ --- │
│ i64 │
╞══════╡
│ 0 │
│ 0 │
│ 10 │
│ 10 │
│ null │
└──────┘
Upvotes: 8
Reputation: 803
Can't use code snippet in comments, so I'll post this slight generalization as an answer.
In case the value to be mapped is missing from the mapping, this accepts a default value if provided, else it will act as if the mapping is the identity mapping.
import polars as pl
def apply_map(
column: str, mapping: dict, default = None
) -> pl.Expr:
branch = pl
for key, value in mapping.items():
branch = branch.when(pl.col(column) == key).then(value)
default = pl.lit(default) if default is not None else pl.col(column)
return branch.otherwise(default).alias(column)
Upvotes: 0
Reputation: 372
You can also use apply
with a dict
, as long as you specify an exhaustive mapping for each from_
option:
df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
from_ = [1, 2, 3, 4, 5]
to_ = [99, 12, 4, 18, 64]
my_map = dict(zip(from_, to_))
df.select(pl.col("a").apply(lambda x: my_map[x]))
which outputs:
shape: (5, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 99 │
├╌╌╌╌╌┤
│ 12 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 18 │
├╌╌╌╌╌┤
│ 64 │
└─────┘
It'll be slower than ritchie46's answer but it's quite a bit simpler.
Upvotes: 0
Reputation: 351
Just in case you like the pandas docstrings as well and want to place it as a utils function somewhere in your repo
def replace(column: str, mapping: dict) -> pl.internals.expr.Expr:
"""
Create a polars expression that replaces a columns values.
Parameters
----------
column : str
Column name on which values should be replaced.
mapping : dict
Can be used to specify different replacement values for different existing values. For example,
``{'a': 'b', 'y': 'z'}`` replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. Values not mentioned in ``mapping``
will stay the same.
Returns
-------
pl.internals.expr.Expr
Expression that contains instructions to replace values in ``column`` according to ``mapping``.
Raises
------
Exception
* If ``mapping`` is empty.
TypeError
* If ``column`` is not ``str``.
* If ``mapping`` is not ``dict``.
polars.exceptions.PanicException
* When ``mapping`` has keys or values that are not mappable to arrows format. Only catchable via BaseException.
See also https://pola-rs.github.io/polars-book/user-guide/datatypes.html.
Examples
--------
>>> import polars as pl
>>> df = pl.DataFrame({'fruit':['banana', 'apple', 'pie']})
>>> df
shape: (3, 1)
┌────────┐
│ fruit │
│ --- │
│ str │
╞════════╡
│ banana │
├╌╌╌╌╌╌╌╌┤
│ apple │
├╌╌╌╌╌╌╌╌┤
│ apple │
└────────┘
>>> df.with_column(replace(column='fruit', mapping={'apple': 'pomegranate'}))
shape: (3, 1)
┌─────────────┐
│ fruit │
│ --- │
│ str │
╞═════════════╡
│ banana │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pomegranate │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ pomegranate │
└─────────────┘
"""
if not mapping:
raise Exception("Mapping can't be empty")
elif not isinstance(mapping, dict):
TypeError(f"mapping must be of type dict, but is type: {type(mapping)}")
if not isinstance(column, str):
raise TypeError(f"column must be of type str, but is type: {type(column)}")
branch = pl.when(pl.col(column) == list(mapping.keys())[0]).then(
list(mapping.values())[0]
)
for from_value, to_value in mapping.items():
branch = branch.when(pl.col(column) == from_value).then(to_value)
return branch.otherwise(pl.col(column)).alias(column)
Upvotes: 3