Damon C. Roberts
Damon C. Roberts

Reputation: 326

Map user-defined function on multiple polars columns

I am doing a bit of data munging on a polars.Dataframe and I could write the same expression twice, but I would ideally like to cut down on that a bit. So I was thinking that I could just create a user-defined function that just plugs in the column names.

But, I know that polars tends to be a bit reluctant to let people bring in user-defined functions (and for good reasons), but it feels a bit tedious for me to write out the same expression over and over again, but with different columns.

So let's say that I have a polars dataframe like this:

import polars as pl
df = pl.DataFrame({
    'a':['Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree'],
    'b':['Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree'],
    'c':['Agree', 'Strongly Agree', 'Strongly Disagree', 'Disagree']
})

And, I could just use the when-then-otherwise expression to convert these three to numeric columns:

df_clean = df.with_columns(
    pl.when(
        pl.col('a') == 'Strongly Disagree'
    ).then(
        pl.lit(1)
    ).when(
        pl.col('a') == 'Disagree'
    ).then(
        pl.lit(2)
    ).when(
        pl.col('a') == 'Agree'
    ).then(
        pl.lit(3)
    ).when(
        pl.col('a') == 'Strongly Agree'
    ).then(
        pl.lit(4)
    )
)

But I don't want to write this out two more times.

So I was thinking, I could just write a function so then I could just map over a, b, and c, but this seems like it wouldn't work.

Anyone have any advice for the most efficient way to do this?

Upvotes: 1

Views: 2953

Answers (2)

Wayoshi
Wayoshi

Reputation: 2903

See replace, which can be broadcast to whatever columns you want, and does the job succinctly:

df_clean = df.with_columns(
    pl.all().replace(
        {'Strongly Disagree': 1, 'Disagree': 2, 'Agree': 3, 'Strongly Agree': 4}
    )
)
shape: (4, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 4   ┆ 3   │
│ 2   ┆ 3   ┆ 4   │
│ 3   ┆ 2   ┆ 1   │
│ 4   ┆ 1   ┆ 2   │
└─────┴─────┴─────┘

If you want to rename the columns like in your follow-up answer, you certainly can with a similar approach:

columns_to_convert = [a,b,c]
new_column_names = [x,y,z]
md = {'Much worse' : -3, ...} # whatever values here

df_clean = df.with_columns(
    pl.col(old_col).replace(md).alias(new_col)
    for old_col, new_col in zip(columns_to_convert, new_column_names)
)

Upvotes: 5

Damon C. Roberts
Damon C. Roberts

Reputation: 326

Think I figured it out!

def seven_likert(df, columns, new_columns):
    """
    Convert string values to numerics for seven-item Likert
    
    Args:
    df (pl.DataFrame): The Polars DataFrame.
    columns (list of str): List of column names to convert.
    new_columns (list of str): List of new column names.
    
    Returns:
    pl.DataFrame: A new DataFrame with the specified columns converted to numeric values.
    """ 
    assert len(columns) == len(new_columns), "Input lists must have the same length"

    for column, new_column in zip(columns, new_columns):
        df = df.with_columns(
            pl.when(
                pl.col(column) == 'Much worse'
            ).then(
                pl.lit(-3)
            ).when(
                pl.col(column) == 'Worse'
            ).then(
                pl.lit(-2)
            ).when(
                pl.col(column) == 'Slightly worse'
            ).then(
                pl.lit(-1)
            ).when(
                pl.col(column) == 'Neither better nor worse'
            ).then(
                pl.lit(0)
            ).when(
                pl.col(column) == 'Slightly better'
            ).then(
                pl.lit(1)
            ).when(
                pl.col(column) == 'Better'
            ).then(
                pl.lit(2)
            ).when(
                pl.col(column) == 'Much better'
            ).then(
                pl.lit(3)
            ).alias(
                new_column
            )
        )
    
    return df
columns_to_convert = [a,b,c]
new_column_names = [x,y,z]
df_clean = seven_likert(df, columns_to_convert, new_column_names)

Upvotes: 1

Related Questions