Creating a new pandas column which takes values from a row, based on id

Question

How do I go about doing the follwing in a pandas dataframe? I have a time-series where I want a new column created that's based on having the same id value it looks for the previous epoch's value. See picture. I would like to do the following:

Create new column called previous_epoch_stage.
For each id:

fill previous_epoch_stage column with stage value from epoch-1 row.
but if epoch == 1 then fill previous_epoch_stage value with stage value from the same row.

smci · Accepted Answer

Generally you don't need to create extra columns, if all you want is to access a lagged version of epoch. You'd simply do df.groupby('id') then reference ['epoch'].shift(1) within each grouped-dataframe.

But if you really insist on doing this, solution using Boolean indexing, shift() and fillna() :

# Do the default lagged assignment for all rows where 'epoch' != 1
df['previous_epoch_stage'] = df.groupby('id')['epoch'].shift(1)
# Now fill NA's in-place from the 'stage' column
df['previous_epoch_stage'].fillna(df['stage'], inplace=True)
# and if you want to reverse fillna and the NaNs coercing your ints to floats:
df['previous_epoch_stage'] = df['previous_epoch_stage'].astype(int)

Notes:

we can shortcut "fill previous_epoch_stage column with stage value from epoch-1 row" if we assume/require rows are sorted in increasing epoch starting from 1, then we can just take df['stage'].head()
there's also a useful helper function df.where(cond, other, ...)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html) that does vectorized if-else, and in this case other would need to be a function ('callable'), but it doesn't play nicely with groupby, so use boolean indexing instead.
.shift() is neat because it allows you to customize fill_value=NaN, or specify arbitrary periods (+ve or -ve).

Creating a new pandas column which takes values from a row, based on id

Answers (2)

Related Questions