Reputation: 41
I am currently trying to create a new column within a polars dataframe (df). Within my df, there are many many rows, and within this new column I only want my existing list to populate wherever certain conditions are met.
I know that in pandas, I could do something like the following (where 'output' is my new column and assuming list is the same length as the filtered df) :
df = pd.DataFrame({
'cond1': [-1, 0, 1, 2, 2, 3, 4, 5],
'cond2': ['No', 'No', 'No', 'Yes','Yes', 'Yes', 'Yes', 'Yes']
})
conditions = ((df['cond1']>=1) & (df['cond2'] == 'Yes'))
list_new = [1,2,3,4,5]
df.loc[conditions, 'output'] = list_new
How would I do this same thing but within polars? I want the total number of rows to stay the same, I just want to update the new column wherever the conditions are met
I tried a few variations of the following but nothing has seemed to work:
df_polars = df_polars.with_columns(pl.when(conditions).then(list_new).alias('output')
Upvotes: 4
Views: 430
Reputation: 21580
You could use .cum_sum()
to generate the indexes for .when
the conditions are true.
condition = (pl.col.cond1 >= 1) & (pl.col.cond2 == 'Yes')
df.with_columns(idx = pl.when(condition).then(condition.cum_sum()) - 1)
shape: (8, 3)
┌───────┬───────┬──────┐
│ cond1 ┆ cond2 ┆ idx │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═══════╪═══════╪══════╡
│ -1 ┆ No ┆ null │
│ 0 ┆ No ┆ null │
│ 1 ┆ No ┆ null │
│ 2 ┆ Yes ┆ 0 │ # <-
│ 2 ┆ Yes ┆ 1 │ # <-
│ 3 ┆ Yes ┆ 2 │ # <-
│ 4 ┆ Yes ┆ 3 │ # <-
│ 5 ┆ Yes ┆ 4 │ # <-
└───────┴───────┴──────┘
Which can be given to .gather()
to get the corresponding values.
condition = (pl.col.cond1 >= 1) & (pl.col.cond2 == 'Yes')
values = pl.lit(pl.Series('output', [10, 20, 30, 40, 50]))
df.with_columns(
values.gather(
pl.when(condition).then(condition.cum_sum()) - 1
)
)
shape: (8, 3)
┌───────┬───────┬────────┐
│ cond1 ┆ cond2 ┆ output │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═══════╪═══════╪════════╡
│ -1 ┆ No ┆ null │
│ 0 ┆ No ┆ null │
│ 1 ┆ No ┆ null │
│ 2 ┆ Yes ┆ 10 │
│ 2 ┆ Yes ┆ 20 │
│ 3 ┆ Yes ┆ 30 │
│ 4 ┆ Yes ┆ 40 │
│ 5 ┆ Yes ┆ 50 │
└───────┴───────┴────────┘
Upvotes: 0
Reputation: 13427
Can't avoid the use of with_row_index
but one can also use DataFrame.update
here
import polars as pl
from polars import col
df = pl.DataFrame({
'cond1': [-1, 0, 1, 2, 2, 3, 4, 5],
'cond2': ['No', 'No', 'No', 'Yes','Yes', 'Yes', 'Yes', 'Yes']
})
list_new = [1, 2, 3, 4, 5]
df = df.with_row_index()
subset = df.select(
col('index').filter(col('cond1') >= 1, col('cond2') == 'Yes'),
pl.Series('output', list_new)
)
print(
df.with_columns(output=None)
.update(subset, on='index', how='outer')
.drop('index')
)
# shape: (8, 3)
# ┌───────┬───────┬────────┐
# │ cond1 ┆ cond2 ┆ output │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ str ┆ i64 │
# ╞═══════╪═══════╪════════╡
# │ -1 ┆ No ┆ null │
# │ 0 ┆ No ┆ null │
# │ 1 ┆ No ┆ null │
# │ 2 ┆ Yes ┆ 1 │
# │ 2 ┆ Yes ┆ 2 │
# │ 3 ┆ Yes ┆ 3 │
# │ 4 ┆ Yes ┆ 4 │
# │ 5 ┆ Yes ┆ 5 │
# └───────┴───────┴────────┘
Upvotes: 2
Reputation: 117540
.with_row index()
to get row indexes..filter()
to get rows based on conditions
.Now we get DataFrame with proper row indexes and new data:
conditions = [pl.col('cond1') >= 1, pl.col('cond2') == 'Yes']
df_output = (
df.with_row_index()
.filter(conditions).select('index', output=pl.Series(list_new)
)
┌───────┬────────┐
│ index ┆ output │
│ --- ┆ --- │
│ u32 ┆ i64 │
╞═══════╪════════╡
│ 3 ┆ 1 │
│ 4 ┆ 2 │
│ 5 ┆ 3 │
│ 6 ┆ 4 │
│ 7 ┆ 5 │
└───────┴────────┘
.join()
to add column to original DataFrame:(
df.with_row_index()
.join(df_output, on='index', how='left')
.drop('index')
)
┌───────┬───────┬────────┐
│ cond1 ┆ cond2 ┆ output │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═══════╪═══════╪════════╡
│ -1 ┆ No ┆ null │
│ 0 ┆ No ┆ null │
│ 1 ┆ No ┆ null │
│ 2 ┆ Yes ┆ 1 │
│ 2 ┆ Yes ┆ 2 │
│ 3 ┆ Yes ┆ 3 │
│ 4 ┆ Yes ┆ 4 │
│ 5 ┆ Yes ┆ 5 │
└───────┴───────┴────────┘
Upvotes: 5