Reputation: 326
I am doing a bit of data munging on a polars.Dataframe
and I could write the same expression twice, but I would ideally like to cut down on that a bit. So I was thinking that I could just create a user-defined function that just plugs in the column names.
But, I know that polars tends to be a bit reluctant to let people bring in user-defined functions (and for good reasons), but it feels a bit tedious for me to write out the same expression over and over again, but with different columns.
So let's say that I have a polars dataframe like this:
import polars as pl
df = pl.DataFrame({
'a':['Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree'],
'b':['Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree'],
'c':['Agree', 'Strongly Agree', 'Strongly Disagree', 'Disagree']
})
And, I could just use the when-then-otherwise
expression to convert these three to numeric columns:
df_clean = df.with_columns(
pl.when(
pl.col('a') == 'Strongly Disagree'
).then(
pl.lit(1)
).when(
pl.col('a') == 'Disagree'
).then(
pl.lit(2)
).when(
pl.col('a') == 'Agree'
).then(
pl.lit(3)
).when(
pl.col('a') == 'Strongly Agree'
).then(
pl.lit(4)
)
)
But I don't want to write this out two more times.
So I was thinking, I could just write a function so then I could just map over a
, b
, and c
, but this seems like it wouldn't work.
Anyone have any advice for the most efficient way to do this?
Upvotes: 1
Views: 2953
Reputation: 2903
See replace
, which can be broadcast to whatever columns you want, and does the job succinctly:
df_clean = df.with_columns(
pl.all().replace(
{'Strongly Disagree': 1, 'Disagree': 2, 'Agree': 3, 'Strongly Agree': 4}
)
)
shape: (4, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 4 ┆ 3 │
│ 2 ┆ 3 ┆ 4 │
│ 3 ┆ 2 ┆ 1 │
│ 4 ┆ 1 ┆ 2 │
└─────┴─────┴─────┘
If you want to rename the columns like in your follow-up answer, you certainly can with a similar approach:
columns_to_convert = [a,b,c]
new_column_names = [x,y,z]
md = {'Much worse' : -3, ...} # whatever values here
df_clean = df.with_columns(
pl.col(old_col).replace(md).alias(new_col)
for old_col, new_col in zip(columns_to_convert, new_column_names)
)
Upvotes: 5
Reputation: 326
Think I figured it out!
def seven_likert(df, columns, new_columns):
"""
Convert string values to numerics for seven-item Likert
Args:
df (pl.DataFrame): The Polars DataFrame.
columns (list of str): List of column names to convert.
new_columns (list of str): List of new column names.
Returns:
pl.DataFrame: A new DataFrame with the specified columns converted to numeric values.
"""
assert len(columns) == len(new_columns), "Input lists must have the same length"
for column, new_column in zip(columns, new_columns):
df = df.with_columns(
pl.when(
pl.col(column) == 'Much worse'
).then(
pl.lit(-3)
).when(
pl.col(column) == 'Worse'
).then(
pl.lit(-2)
).when(
pl.col(column) == 'Slightly worse'
).then(
pl.lit(-1)
).when(
pl.col(column) == 'Neither better nor worse'
).then(
pl.lit(0)
).when(
pl.col(column) == 'Slightly better'
).then(
pl.lit(1)
).when(
pl.col(column) == 'Better'
).then(
pl.lit(2)
).when(
pl.col(column) == 'Much better'
).then(
pl.lit(3)
).alias(
new_column
)
)
return df
columns_to_convert = [a,b,c]
new_column_names = [x,y,z]
df_clean = seven_likert(df, columns_to_convert, new_column_names)
Upvotes: 1