hernanavella
hernanavella

Reputation: 5552

How to perform conditional dataframe operations?

Given df

A = pd.DataFrame([[1, 5, 2, 1, 2], [2, 4, 4, 1, 2], [3, 3, 1, 1, 2], [4, 2, 2, 3, 0], 
              [5, 1, 4, 3, -4], [1, 5, 2, 3, -20], [2, 4, 4, 2, 0], [3, 3, 1, 2, -1], 
              [4, 2, 2, 2, 0], [5, 1, 4, 2, -2]],
             columns=['a', 'b', 'c', 'd', 'e'],
             index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

How can I create a column 'f', that corresponds to the last value in column 'e' before a change in value in column 'd', and holds that value until the next change in value in column 'd' the output would be:

    a   b   c   d   e    f
1   1   5   2   1   2    nan
2   2   4   4   1   2    nan
3   3   3   1   1   2    nan
4   4   2   2   3   0     2
5   5   1   4   3   -4    2
6   1   5   2   3   -20   2
7   2   4   4   2   0    -20
8   3   3   1   2   -1   -20
9   4   2   2   2   0    -20
10  5   1   4   2   -2   -20

Edit: @Noobie presented a solution that when applied in real data, it breaks down when there's a smaller than previous value in column 'd'

Upvotes: 3

Views: 68

Answers (2)

DSM
DSM

Reputation: 353009

I think we should offer better native support for dealing with contiguous groups, but until then you can us the compare-cumsum-groupby pattern:

g = (A["d"] != A["d"].shift()).cumsum()
A["f"] = A["e"].groupby(g).last().shift().loc[g].values

which gives me

In [41]: A
Out[41]: 
    a  b  c  d   e     f
1   1  5  2  1   2   NaN
2   2  4  4  1   2   NaN
3   3  3  1  1   2   NaN
4   4  2  2  2   0   2.0
5   5  1  4  2  -4   2.0
6   1  5  2  2 -20   2.0
7   2  4  4  3   0 -20.0
8   3  3  1  3  -1 -20.0
9   4  2  2  3   0 -20.0
10  5  1  4  3  -2 -20.0

This works because g is a count corresponding to each contiguous group of d values. Note that in this case, using the example you posted, g is the same as column "d", but that needn't be the case. Once we have g, we can use it to group column e:

In [55]: A["e"].groupby(g).last()
Out[55]: 
d
1     2
2   -20
3    -2
Name: e, dtype: int64

and then

In [57]: A["e"].groupby(g).last().shift()
Out[57]: 
d
1     NaN
2     2.0
3   -20.0
Name: e, dtype: float64

In [58]: A["e"].groupby(g).last().shift().loc[g]
Out[58]: 
d
1     NaN
1     NaN
1     NaN
2     2.0
2     2.0
2     2.0
3   -20.0
3   -20.0
3   -20.0
3   -20.0
Name: e, dtype: float64

Upvotes: 4

ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19375

easy my friend. unleash the POWER OF PANDAS !

A.sort_values(by = 'd', inplace = True)
A['lag'] = A.e.shift(1)
A['output'] = A.groupby('d').lag.transform(lambda x : x.iloc[0])

A
Out[57]: 
    a  b  c  d   e   lag  output
1   1  5  2  1   2   NaN     NaN
2   2  4  4  1   2   2.0     NaN
3   3  3  1  1   2   2.0     NaN
4   4  2  2  2   0   2.0     2.0
5   5  1  4  2  -4   0.0     2.0
6   1  5  2  2 -20  -4.0     2.0
7   2  4  4  3   0 -20.0   -20.0
8   3  3  1  3  -1   0.0   -20.0
9   4  2  2  3   0  -1.0   -20.0
10  5  1  4  3  -2   0.0   -20.0

Upvotes: 1

Related Questions