Reputation: 5552
Given df
A = pd.DataFrame([[1, 5, 2, 1, 2], [2, 4, 4, 1, 2], [3, 3, 1, 1, 2], [4, 2, 2, 3, 0],
[5, 1, 4, 3, -4], [1, 5, 2, 3, -20], [2, 4, 4, 2, 0], [3, 3, 1, 2, -1],
[4, 2, 2, 2, 0], [5, 1, 4, 2, -2]],
columns=['a', 'b', 'c', 'd', 'e'],
index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How can I create a column 'f'
, that corresponds to the last value in column 'e'
before a change in value in column 'd'
, and holds that value until the next change in value in column 'd'
the output would be:
a b c d e f
1 1 5 2 1 2 nan
2 2 4 4 1 2 nan
3 3 3 1 1 2 nan
4 4 2 2 3 0 2
5 5 1 4 3 -4 2
6 1 5 2 3 -20 2
7 2 4 4 2 0 -20
8 3 3 1 2 -1 -20
9 4 2 2 2 0 -20
10 5 1 4 2 -2 -20
Edit: @Noobie
presented a solution that when applied in real data, it breaks down when there's a smaller than previous value in column 'd'
Upvotes: 3
Views: 68
Reputation: 353009
I think we should offer better native support for dealing with contiguous groups, but until then you can us the compare-cumsum-groupby pattern:
g = (A["d"] != A["d"].shift()).cumsum()
A["f"] = A["e"].groupby(g).last().shift().loc[g].values
which gives me
In [41]: A
Out[41]:
a b c d e f
1 1 5 2 1 2 NaN
2 2 4 4 1 2 NaN
3 3 3 1 1 2 NaN
4 4 2 2 2 0 2.0
5 5 1 4 2 -4 2.0
6 1 5 2 2 -20 2.0
7 2 4 4 3 0 -20.0
8 3 3 1 3 -1 -20.0
9 4 2 2 3 0 -20.0
10 5 1 4 3 -2 -20.0
This works because g
is a count corresponding to each contiguous group of d
values. Note that in this case, using the example you posted, g is the same as column "d", but that needn't be the case. Once we have g
, we can use it to group column e
:
In [55]: A["e"].groupby(g).last()
Out[55]:
d
1 2
2 -20
3 -2
Name: e, dtype: int64
and then
In [57]: A["e"].groupby(g).last().shift()
Out[57]:
d
1 NaN
2 2.0
3 -20.0
Name: e, dtype: float64
In [58]: A["e"].groupby(g).last().shift().loc[g]
Out[58]:
d
1 NaN
1 NaN
1 NaN
2 2.0
2 2.0
2 2.0
3 -20.0
3 -20.0
3 -20.0
3 -20.0
Name: e, dtype: float64
Upvotes: 4
Reputation: 19375
easy my friend. unleash the POWER OF PANDAS
!
A.sort_values(by = 'd', inplace = True)
A['lag'] = A.e.shift(1)
A['output'] = A.groupby('d').lag.transform(lambda x : x.iloc[0])
A
Out[57]:
a b c d e lag output
1 1 5 2 1 2 NaN NaN
2 2 4 4 1 2 2.0 NaN
3 3 3 1 1 2 2.0 NaN
4 4 2 2 2 0 2.0 2.0
5 5 1 4 2 -4 0.0 2.0
6 1 5 2 2 -20 -4.0 2.0
7 2 4 4 3 0 -20.0 -20.0
8 3 3 1 3 -1 0.0 -20.0
9 4 2 2 3 0 -1.0 -20.0
10 5 1 4 3 -2 0.0 -20.0
Upvotes: 1