measure_theory
measure_theory

Reputation: 874

Pandas: convert first value in group to np.nan

I have the following DataFrame:

df = pd.DataFrame({'series1':['A','A','A','A','B','B','B','C','C','C','C'],
                   'series2':[0,1,10,99,-9,9,0,10,20,10,10]})
   series1  series2
0        A      0.0
1        A      1.0
2        A     10.0
3        A     99.0
4        B     -9.0
5        B      9.0
6        B      0.0
7        C     10.0
8        C     20.0
9        C     10.0
10       C     10.0

What I want:

df2 = pd.DataFrame({'series1':['A','A','A','A','B','B','B','C','C','C','C'],
                   'series2':[np.nan,1,10,99,np.nan,9,0,np.nan,20,10,10]})
   series1  series2
0        A      NaN
1        A      1.0
2        A     10.0
3        A     99.0
4        B      NaN
5        B      9.0
6        B      0.0
7        C      NaN
8        C     20.0
9        C     10.0
10       C     10.0

I have a feeling this might be able to be done by using Pandas .groupby function:

df.groupby('series1').first()
         series2
series1         
A              0
B             -9
C             10

which gives me the observations I want to convert to NaNs, but I can't figure out a way to easily replace this in the original DataFrame.

This is just a simple example, the actual dataframe I'm working for has >8,000,000 observations.

Upvotes: 1

Views: 121

Answers (4)

BENY
BENY

Reputation: 323236

Or you can using head, first or nth all give back same result by the index slicing.

    df.loc[df.groupby('series1',as_index=False).head(1).index,'series2'] = np.nan
    #df.loc[df.groupby('series1',as_index=False).first().index,'series2'] = np.nan
    #df.loc[df.groupby('series1',as_index=False).nth(1).index,'series2'] = np.nan

Upvotes: 2

akuiper
akuiper

Reputation: 214957

Another option by shifting the column:

df['series2'] = df.groupby('series1').series2.transform(lambda x: x.shift(-1).shift())

df
#  series1  series2
#0       A      NaN
#1       A      1.0
#2       A     10.0
#3       A     99.0
#4       B      NaN
#5       B      9.0
#6       B      0.0
#7       C      NaN
#8       C     20.0
#9       C     10.0
#10      C     10.0

Upvotes: 2

DSM
DSM

Reputation: 353059

There's probably a slicker way to do this, but the first element in each group is the 0th element in that group, and cumcount numbers the elements within each group. So:

In [19]: df.loc[df.groupby('series1').cumcount() == 0, 'series2'] = np.nan

In [20]: df
Out[20]: 
   series1  series2
0        A      NaN
1        A      1.0
2        A     10.0
3        A     99.0
4        B      NaN
5        B      9.0
6        B      0.0
7        C      NaN
8        C     20.0
9        C     10.0
10       C     10.0

Upvotes: 4

DYZ
DYZ

Reputation: 57033

You want to locate discontinuities in series1 by shifting it down and comparing to itself:

df.loc[df['series1'].shift() != df['series1'], 'series2'] = np.nan

Upvotes: 2

Related Questions