Reputation: 53
I have a panel data, I would like to drop the first (few) row(s) which contain NaN in each group. (Or some general method which could drop based on the index within the group and other conditions.)
df = pd.DataFrame(
{'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003, 10003],
'PRICE': [None, 11.5, 14.31, 15.125, 14.44, None, None, None, None, 23.55],
'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
19920113, 19920114, 19920115, 19920116]},
index = range(1,11))
The data would look like:
ID PRICE date
1 10001 NaN 19920103
2 10001 11.500 19920106
3 10001 14.310 19920107
4 10002 15.125 19920108
5 10002 14.440 19920109
6 10002 NaN 19920110
7 10003 NaN 19920113
8 10003 NaN 19920114
9 10003 NaN 19920115
10 10003 23.550 19920116
I would like to drop line 1 and 7, but not line 9, since line 9 is not one of the first few missing observations, I tried
def mask_first_missing(x):
result = x.notnull() & x.rank()==1
return result
mask = df.groupby(['ID'])['PRICE'].transform(mask_first_missing).astype(bool)
print(df[mask])
But it removed row 1, 7 and 9, apparently row 9 is not the first observation in group 3,
If I do this
df[df.groupby('ID', as_index=False)['PRICE'].nth(0).notnull()]
Then the index created by groupby object is not aligned with the original dataframe
Could anybody help me with this? Thank you
Upvotes: 2
Views: 511
Reputation: 53
Thank you for your helps, but I think neither of the answers fit my task.
I figured out a solution myself, by creating a subindex column.
df = pd.DataFrame(
{'ID': [10001, 10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003, 10003],
'PRICE': [None, 11.5, None, 14.31, 15.125, 14.44, None, None, None, None, 23.55],
'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
19920113, 19920114, 19920115, 19920116, 19920122]},
index = range(1,12))
df.loc[:, 'subindex'] = df.groupby('ID').cumcount()
Then one will obtain
ID PRICE date subindex
1 10001 NaN 19920103 0
2 10001 11.500 19920106 1
3 10001 NaN 19920107 2
4 10001 14.310 19920108 3
5 10002 15.125 19920109 0
6 10002 14.440 19920110 1
7 10002 NaN 19920113 2
8 10003 NaN 19920114 0
9 10003 NaN 19920115 1
10 10003 NaN 19920116 2
11 10003 23.550 19920122 3
Instead of doing everything on based groupby, now I can select the nth observation of each groups based on column 'subindex'.
Now if I want to drop the first two NaN observation of 'PRICE' of each group, I can create a mask
mask_first_few_nan = (df.loc[:, 'PRICE'].isnull()) & (df.loc[:, 'subindex'] <= 1)
df[~mask_first_few_nan]
The result is
ID PRICE date subindex
2 10001 11.500 19920106 1
3 10001 NaN 19920107 2
4 10001 14.310 19920108 3
5 10002 15.125 19920109 0
6 10002 14.440 19920110 1
7 10002 NaN 19920113 2
10 10003 NaN 19920116 2
11 10003 23.550 19920122 3
Upvotes: 0
Reputation: 210852
alternatve approach using custom ranking:
In [49]: %paste
df[df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan))
.groupby('ID').x.cumsum().fillna(np.inf) > 1
]
## -- End pasted text --
Out[49]:
ID PRICE date
2 10001 11.500 19920106
3 10001 14.310 19920107
4 10002 15.125 19920108
5 10002 14.440 19920109
6 10002 14.120 19920110
8 10003 16.500 19920114
9 10003 NaN 19920115
Explanation:
In [50]: df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan))
Out[50]:
ID PRICE date x
1 10001 NaN 19920103 1.0
2 10001 11.500 19920106 NaN
3 10001 14.310 19920107 NaN
4 10002 15.125 19920108 NaN
5 10002 14.440 19920109 NaN
6 10002 14.120 19920110 NaN
7 10003 NaN 19920113 1.0
8 10003 16.500 19920114 NaN
9 10003 NaN 19920115 1.0
In [51]: df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan)).groupby('ID').x.cumsum().fillna(np.inf)
Out[51]:
1 1.000000
2 inf
3 inf
4 inf
5 inf
6 inf
7 1.000000
8 inf
9 2.000000
Name: x, dtype: float64
In [52]: df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan)).groupby('ID').x.cumsum().fillna(np.inf) > 1
Out[52]:
1 False
2 True
3 True
4 True
5 True
6 True
7 False
8 True
9 True
Name: x, dtype: bool
Upvotes: 0
Reputation: 294328
This is a way to do it:
notnull = df.PRICE.notnull()
protected = df.index > df.PRICE.last_valid_index()
df[notnull | protected]
Upvotes: 1