Reputation: 874
I have the following DataFrame:
df = pd.DataFrame({'series1':['A','A','A','A','B','B','B','C','C','C','C'],
'series2':[0,1,10,99,-9,9,0,10,20,10,10]})
series1 series2
0 A 0.0
1 A 1.0
2 A 10.0
3 A 99.0
4 B -9.0
5 B 9.0
6 B 0.0
7 C 10.0
8 C 20.0
9 C 10.0
10 C 10.0
What I want:
df2 = pd.DataFrame({'series1':['A','A','A','A','B','B','B','C','C','C','C'],
'series2':[np.nan,1,10,99,np.nan,9,0,np.nan,20,10,10]})
series1 series2
0 A NaN
1 A 1.0
2 A 10.0
3 A 99.0
4 B NaN
5 B 9.0
6 B 0.0
7 C NaN
8 C 20.0
9 C 10.0
10 C 10.0
I have a feeling this might be able to be done by using Pandas .groupby function:
df.groupby('series1').first()
series2
series1
A 0
B -9
C 10
which gives me the observations I want to convert to NaNs, but I can't figure out a way to easily replace this in the original DataFrame.
This is just a simple example, the actual dataframe I'm working for has >8,000,000 observations.
Upvotes: 1
Views: 121
Reputation: 323236
Or you can using head
, first
or nth
all give back same result by the index slicing.
df.loc[df.groupby('series1',as_index=False).head(1).index,'series2'] = np.nan
#df.loc[df.groupby('series1',as_index=False).first().index,'series2'] = np.nan
#df.loc[df.groupby('series1',as_index=False).nth(1).index,'series2'] = np.nan
Upvotes: 2
Reputation: 214957
Another option by shifting the column:
df['series2'] = df.groupby('series1').series2.transform(lambda x: x.shift(-1).shift())
df
# series1 series2
#0 A NaN
#1 A 1.0
#2 A 10.0
#3 A 99.0
#4 B NaN
#5 B 9.0
#6 B 0.0
#7 C NaN
#8 C 20.0
#9 C 10.0
#10 C 10.0
Upvotes: 2
Reputation: 353059
There's probably a slicker way to do this, but the first element in each group is the 0th element in that group, and cumcount
numbers the elements within each group. So:
In [19]: df.loc[df.groupby('series1').cumcount() == 0, 'series2'] = np.nan
In [20]: df
Out[20]:
series1 series2
0 A NaN
1 A 1.0
2 A 10.0
3 A 99.0
4 B NaN
5 B 9.0
6 B 0.0
7 C NaN
8 C 20.0
9 C 10.0
10 C 10.0
Upvotes: 4
Reputation: 57033
You want to locate discontinuities in series1
by shifting it down and comparing to itself:
df.loc[df['series1'].shift() != df['series1'], 'series2'] = np.nan
Upvotes: 2