Reputation: 89
Basically I want to create a new column which replicates the last different value of another column.
I tried this but doesn't work.
import pandas as pd
import numpy as np
column1= [1,2,2,2,3,3,3,3,3,3,2,2,2,2 ]
data = pd.DataFrame(column1, columns =['column1'])
data['column2'] = np.where(data['column1'] != data['column1'].shift(1), data['column1'].shift(1),data['column2'].shift(1))
Also tried this:
data['column2'] = [data['column2'].shift(1) if x != data['column1'].shift(1) else data['column1'].shift(1) for x in data['column1']]
Desired output is as follows:
column1,column2
1,
2,1
2,1
2,1
3,2
3,2
3,2
3,2
3,2
3,2
2,3
2,3
2,3
2,3
Oh and while I'm sure I can do this with a for loop, I'm interested in finding a way to get this done without a loop.
Thanks
Edit: This method gets closer, but I ned to run these two lines of code once at a time, so it's extremely impractical.
data.loc[data['column1'] != data['column1'].shift(1), 'column2'] = data['column1'].shift(1)
data.loc[data['column1'] == data['column1'].shift(1), 'column2'] = data['column2'].shift(1)
Upvotes: 1
Views: 69
Reputation: 350
This works for your specific example
data['column2'] = data.diff(1).apply(lambda r: data.loc[r.name - 1, 'column1'] if abs(r.column1) == 1 else None, axis=1).ffill()
How fast it is?
%%timeit
data.diff(1).apply(lambda r: data.loc[r.name - 1, 'column1'] if abs(r.column1) == 1 else None, axis=1).ffill()
1.41 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
s = data.column1.diff().ne(0).cumsum()
data.shift().groupby(s).column1.transform('first')
2.36 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 25259
try shift
and groupby on custom groupID s
and transform first
s = data.column1.diff().ne(0).cumsum()
data['column2'] = data.shift().groupby(s).column1.transform('first')
Out[374]:
column1 column2
0 1 NaN
1 2 1.0
2 2 1.0
3 2 1.0
4 3 2.0
5 3 2.0
6 3 2.0
7 3 2.0
8 3 2.0
9 3 2.0
10 2 3.0
11 2 3.0
12 2 3.0
13 2 3.0
Upvotes: 3
Reputation: 89
Extremely slow 'solution' I just found
data['column2'] =np.NaN
nacount=1
nacount2=0
while nacount != nacount2:
nacount = data['column2'].isna().sum()
data.loc[data['column1'] != data['column1'].shift(1), 'column2'] = data['column1'].shift(1)
data.loc[data['column1'] == data['column1'].shift(1), 'column2'] = data['column2'].shift(1)
nacount2 = data['column2'].isna().sum()
I'm not happy with it, excel does this formula faster.
Upvotes: 0