Reputation: 181

Pandas REPLACE equivalent in Python Polars

Is there an elegant way how to recode values in polars dataframe.

For example

1->0, 
2->0, 
3->1...

in Pandas it is simple like that:

df.replace([1,2,3,4,97,98,99],[0,0,1,1,2,2,2])

Upvotes: 16

Answers (6)

edesz

Reputation: 12406

This could also be done by

converting a mapping dictionary into a polars.DataFrame
LEFT JOINing the original data to the mapper DataFrame
fill any missing values (which the mapper dict does not account for) with the original values
drop the original column

Here is the code to show this

Data

df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
print(df)
shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
│ 4   │
│ 5   │
└─────┘

Define the mapper dict

mapper = {1: 0, 2: 0, 3: 10, 4: 10}

Create a mapper DataFrame

df_mapper = pl.DataFrame([{"a": k, "values": v} for k, v in mapper.items()])
print(df_mapper)
shape: (4, 2)
┌─────┬────────┐
│ a   ┆ values │
│ --- ┆ ---    │
│ i64 ┆ i64    │
╞═════╪════════╡
│ 1   ┆ 0      │
│ 2   ┆ 0      │
│ 3   ┆ 10     │
│ 4   ┆ 10     │
└─────┴────────┘

Use LEFT JOIN and .fill_null() to map the values, and then drop the original column

df = (
    df
    # LEFT JOIN
    .join(df_mapper, on=["a"], how="left")
    # fill missing values in mapped column with values from original column
    .with_columns([pl.col("values").fill_null(pl.col("a"))])
    # drop original column and replace with mapped column
    .drop(["a"])
    .rename({"values": "a"})
)
print(df)
shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 0   │
│ 10  │
│ 10  │
│ 5   │
└─────┘

Upvotes: 2

ritchie46

Reputation: 14730

Edit 2024-07-09

Polars has dedicated replace and replace_strict expressions.

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5]
})

mapper = {
    1: 0,
    2: 0,
    3: 10,
    4: 10
}

df.select(
    pl.all().replace(mapper)
)

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 0   │
│ 10  │
│ 10  │
│ 5   │
└─────┘

Before Edit

In polars you can build columnar if else statetements called if -> then -> otherwise expressions.

So let's say we have this DataFrame.

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5]
})

And we'd like to replace these with the following values:

from_ = [1, 2]
to_ = [99, 12]

We could write:

df.with_columns(
    pl.when(pl.col("a") == from_[0])
    .then(to_[0])
    .when(pl.col("a") == from_[1])
    .then(to_[1])
    .otherwise(pl.col("a")).alias("a")
)

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 99  │
│ 12  │
│ 3   │
│ 4   │
│ 5   │
└─────┘

Don't repeat yourself

Now, this becomes very tedious to write really fast, so we could write a function that generates these expressions for use, we are programmers aren't we!

So to replace with the values you have suggested, you could do:

from_ = [1,2,3,4,97,98,99]
to_ = [0,0,1,1,2,2,2]


def replace(column, from_, to_):
    # initiate the expression with `pl.when`
    branch = pl.when(pl.col(column) == from_[0]).then(to_[0])

    
    # for every value add a `when.then`
    for (from_value, to_value) in zip(from_, to_):
        branch = branch.when(pl.col(column) == from_value).then(to_value)

    # finish with an `otherwise`
    return branch.otherwise(pl.col(column)).alias(column)
    


df.with_columns(replace("a", from_, to_))

Which outputs:

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 0   │
│ 1   │
│ 1   │
│ 5   │
└─────┘

Upvotes: 29

Hericks

Reputation: 10434

Update. November 2023

With the release of polars 0.19.16, map_dict was deprecated in favour of pl.Expr.replace.

import polars as pl

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5]
})

mapping_dict = {
    1: 0,
    2: 0,
    3: 10,
    4: 10
}

df.with_columns(
    pl.all().replace(mapping_dict)
)

Output.

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 0   │
│ 10  │
│ 10  │
│ 5   │
└─────┘

Note that the default behaviour has changed to keep any values not present in the mapping unchanged. To restore the previous default behaviour (and replace any values not present in the mapping with None), pass default=None as follows.

df.with_columns(
    pl.all().replace(mapping_dict, default=None)
)

Output.

shape: (5, 1)
┌──────┐
│ a    │
│ ---  │
│ i64  │
╞══════╡
│ 0    │
│ 0    │
│ 10   │
│ 10   │
│ null │
└──────┘

Upvotes: 8

Radu

Reputation: 803

Can't use code snippet in comments, so I'll post this slight generalization as an answer.

In case the value to be mapped is missing from the mapping, this accepts a default value if provided, else it will act as if the mapping is the identity mapping.

import polars as pl
def apply_map(
    column: str, mapping: dict, default = None
) -> pl.Expr:
    branch = pl
    for key, value in mapping.items():
        branch = branch.when(pl.col(column) == key).then(value)
    default = pl.lit(default) if default is not None else pl.col(column)
    return branch.otherwise(default).alias(column)

Upvotes: 0

NedDasty

Reputation: 372

You can also use apply with a dict, as long as you specify an exhaustive mapping for each from_ option:

df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})

from_ = [1, 2, 3, 4, 5]
to_ = [99, 12, 4, 18, 64]
my_map = dict(zip(from_, to_))

df.select(pl.col("a").apply(lambda x: my_map[x]))

which outputs:

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 99  │
├╌╌╌╌╌┤
│ 12  │
├╌╌╌╌╌┤
│ 4   │
├╌╌╌╌╌┤
│ 18  │
├╌╌╌╌╌┤
│ 64  │
└─────┘

It'll be slower than ritchie46's answer but it's quite a bit simpler.

Upvotes: 0

supersick

Reputation: 351

Just in case you like the pandas docstrings as well and want to place it as a utils function somewhere in your repo

def replace(column: str, mapping: dict) -> pl.internals.expr.Expr:
    """
    Create a polars expression that replaces a columns values.

    Parameters
    ----------
    column : str
        Column name on which values should be replaced.
    mapping : dict
        Can be used to specify different replacement values for different existing values. For example,
        ``{'a': 'b', 'y': 'z'}`` replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. Values not mentioned in ``mapping``
        will stay the same.

    Returns
    -------
    pl.internals.expr.Expr
        Expression that contains instructions to replace values in ``column`` according to ``mapping``.

    Raises
    ------
    Exception
        * If ``mapping`` is empty.
    TypeError
        * If ``column`` is not ``str``.
        * If ``mapping`` is not ``dict``.
    polars.exceptions.PanicException
        * When ``mapping`` has keys or values that are not mappable to arrows format. Only catchable via BaseException.
          See also https://pola-rs.github.io/polars-book/user-guide/datatypes.html.

    Examples
    --------
    >>> import polars as pl
    >>> df = pl.DataFrame({'fruit':['banana', 'apple', 'pie']})
    >>> df
    shape: (3, 1)
    ┌────────┐
    │ fruit  │
    │ ---    │
    │ str    │
    ╞════════╡
    │ banana │
    ├╌╌╌╌╌╌╌╌┤
    │ apple  │
    ├╌╌╌╌╌╌╌╌┤
    │ apple  │
    └────────┘
    >>> df.with_column(replace(column='fruit', mapping={'apple': 'pomegranate'}))
    shape: (3, 1)
    ┌─────────────┐
    │ fruit       │
    │ ---         │
    │ str         │
    ╞═════════════╡
    │ banana      │
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ pomegranate │
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ pomegranate │
    └─────────────┘

    """
    if not mapping:
        raise Exception("Mapping can't be empty")
    elif not isinstance(mapping, dict):
        TypeError(f"mapping must be of type dict, but is type: {type(mapping)}")
    if not isinstance(column, str):
        raise TypeError(f"column must be of type str, but is type: {type(column)}")

    branch = pl.when(pl.col(column) == list(mapping.keys())[0]).then(
        list(mapping.values())[0]
    )

    for from_value, to_value in mapping.items():
        branch = branch.when(pl.col(column) == from_value).then(to_value)

    return branch.otherwise(pl.col(column)).alias(column)

Upvotes: 3