Reputation: 5660
I want to ensure that the first value of val2
corresponding to each vintage
is NaN
. Currently two are already NaN
, but I want to ensure that 0.53
also changes to NaN
.
df = pd.DataFrame({
'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
})
Here's what I've tried so far:
df.groupby('vintage').first().val2 #This gives the first non-NaN values, as shown below
vintage
2017-01-01 0.66
2017-02-01 0.53
2017-03-01 NaN
df.groupby('vintage').first().val2 = np.nan #This doesn't change anything
df.val2
0 NaN
1 0.66
2 0.81
3 0.53
4 0.62
5 NaN
Upvotes: 14
Views: 10883
Reputation: 2554
Timings:
df = pd.DataFrame({
'vintage': ['2017-01-01', '2017-01-01', '2017-01-01', '2017-02-01', '2017-02-01', '2017-03-01'],
'date': ['2017-01-01', '2017-02-01', '2017-03-01', '2017-02-01', '2017-03-01', '2017-03-01'],
'val1': [0.59, 0.68, 0.8, 0.54, 0.61, 0.6],
'val2': [np.nan, 0.66, 0.81, 0.53, 0.62, np.nan]
})
def BENY(df):
df.loc[df.groupby('vintage').vintage.cumcount() == 0, 'val2'] = np.nan
def EdChum(df):
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.nan
def knoble(df):
def func(x):
x['val2'].iloc[0] = np.nan
return x
df.groupby("vintage", group_keys=False).apply(func)
%timeit BENY(df)
406 µs ± 4.19 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit EdChum(df)
454 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit knoble(df)
1.07 ms ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Upvotes: 1
Reputation: 11
I think you could also write:
def h(x):
x['val2'].iloc[0] = np.NaN
return x
df = df.groupby("vintage").apply(h)
Upvotes: 1
Reputation: 323226
Or create the Position , pick the first one , change val2 to np.nan
df.loc[df.groupby('vintage').vintage.cumcount()==0,'val2']=np.nan
df
Out[154]:
date val1 val2 vintage
0 2017-01-01 0.59 NaN 2017-01-01
1 2017-02-01 0.68 0.66 2017-01-01
2 2017-03-01 0.80 0.81 2017-01-01
3 2017-02-01 0.54 NaN 2017-02-01
4 2017-03-01 0.61 0.62 2017-02-01
5 2017-03-01 0.60 NaN 2017-03-01
Upvotes: 5
Reputation: 394041
You can't assign to the result of an aggregation, also first
ignores existing NaN
, what you can do is call head(1)
which will return the first row for each group, and pass the indices to loc
to mask the orig df to overwrite those column values:
In[91]
df.loc[df.groupby('vintage')['val2'].head(1).index, 'val2'] = np.NaN
df:
Out[91]:
date val1 val2 vintage
0 2017-01-01 0.59 NaN 2017-01-01
1 2017-02-01 0.68 0.66 2017-01-01
2 2017-03-01 0.80 0.81 2017-01-01
3 2017-02-01 0.54 NaN 2017-02-01
4 2017-03-01 0.61 0.62 2017-02-01
5 2017-03-01 0.60 NaN 2017-03-01
here you can see that head(1)
returns the first row for each group:
In[94]:
df.groupby('vintage')['val2'].head(1)
Out[94]:
0 NaN
3 0.53
5 NaN
Name: val2, dtype: float64
contrast with first
which will return the first non-NaN unless there is only NaN
values for that group:
In[95]:
df.groupby('vintage')['val2'].first()
Out[95]:
vintage
2017-01-01 0.66
2017-02-01 0.53
2017-03-01 NaN
Name: val2, dtype: float64
Upvotes: 26