GrumpyJun
GrumpyJun

Reputation: 53

Pandas drop the first few rows contain nan in each group

I have a panel data, I would like to drop the first (few) row(s) which contain NaN in each group. (Or some general method which could drop based on the index within the group and other conditions.)

df = pd.DataFrame(
{'ID': [10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003, 10003],
 'PRICE': [None, 11.5, 14.31, 15.125, 14.44, None, None, None, None, 23.55],
 'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
          19920113, 19920114, 19920115, 19920116]},
index = range(1,11))

The data would look like:

    ID      PRICE   date
1   10001   NaN     19920103
2   10001   11.500  19920106
3   10001   14.310  19920107
4   10002   15.125  19920108
5   10002   14.440  19920109
6   10002   NaN     19920110
7   10003   NaN     19920113
8   10003   NaN     19920114
9   10003   NaN     19920115
10  10003   23.550  19920116

I would like to drop line 1 and 7, but not line 9, since line 9 is not one of the first few missing observations, I tried

def mask_first_missing(x):
    result = x.notnull() & x.rank()==1
    return result

mask = df.groupby(['ID'])['PRICE'].transform(mask_first_missing).astype(bool)
print(df[mask])

But it removed row 1, 7 and 9, apparently row 9 is not the first observation in group 3,

If I do this

df[df.groupby('ID', as_index=False)['PRICE'].nth(0).notnull()]

Then the index created by groupby object is not aligned with the original dataframe

Could anybody help me with this? Thank you

Upvotes: 2

Views: 511

Answers (3)

GrumpyJun
GrumpyJun

Reputation: 53

Thank you for your helps, but I think neither of the answers fit my task.

I figured out a solution myself, by creating a subindex column.

df = pd.DataFrame(
{'ID': [10001, 10001, 10001, 10001, 10002, 10002, 10002, 10003, 10003, 10003, 10003],
 'PRICE': [None, 11.5, None, 14.31, 15.125, 14.44, None, None, None, None, 23.55],
 'date': [19920103, 19920106, 19920107, 19920108, 19920109, 19920110,
          19920113, 19920114, 19920115, 19920116, 19920122]},
index = range(1,12)) 

df.loc[:, 'subindex'] = df.groupby('ID').cumcount()

Then one will obtain

    ID      PRICE   date    subindex
1   10001   NaN     19920103    0
2   10001   11.500  19920106    1
3   10001   NaN     19920107    2
4   10001   14.310  19920108    3
5   10002   15.125  19920109    0
6   10002   14.440  19920110    1
7   10002   NaN     19920113    2
8   10003   NaN     19920114    0
9   10003   NaN     19920115    1
10  10003   NaN     19920116    2
11  10003   23.550  19920122    3

Instead of doing everything on based groupby, now I can select the nth observation of each groups based on column 'subindex'.

Now if I want to drop the first two NaN observation of 'PRICE' of each group, I can create a mask

mask_first_few_nan = (df.loc[:, 'PRICE'].isnull()) & (df.loc[:, 'subindex'] <= 1)
df[~mask_first_few_nan]

The result is

    ID      PRICE   date    subindex
2   10001   11.500  19920106    1
3   10001   NaN     19920107    2
4   10001   14.310  19920108    3
5   10002   15.125  19920109    0
6   10002   14.440  19920110    1
7   10002   NaN     19920113    2
10  10003   NaN     19920116    2
11  10003   23.550  19920122    3

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

alternatve approach using custom ranking:

In [49]: %paste
df[df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan))
     .groupby('ID').x.cumsum().fillna(np.inf) > 1
]
## -- End pasted text --
Out[49]:
      ID   PRICE      date
2  10001  11.500  19920106
3  10001  14.310  19920107
4  10002  15.125  19920108
5  10002  14.440  19920109
6  10002  14.120  19920110
8  10003  16.500  19920114
9  10003     NaN  19920115

Explanation:

In [50]: df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan))
Out[50]:
      ID   PRICE      date    x
1  10001     NaN  19920103  1.0
2  10001  11.500  19920106  NaN
3  10001  14.310  19920107  NaN
4  10002  15.125  19920108  NaN
5  10002  14.440  19920109  NaN
6  10002  14.120  19920110  NaN
7  10003     NaN  19920113  1.0
8  10003  16.500  19920114  NaN
9  10003     NaN  19920115  1.0

In [51]: df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan)).groupby('ID').x.cumsum().fillna(np.inf)
Out[51]:
1    1.000000
2         inf
3         inf
4         inf
5         inf
6         inf
7    1.000000
8         inf
9    2.000000
Name: x, dtype: float64

In [52]: df.assign(x=np.where(pd.isnull(df.PRICE), 1, np.nan)).groupby('ID').x.cumsum().fillna(np.inf) > 1
Out[52]:
1    False
2     True
3     True
4     True
5     True
6     True
7    False
8     True
9     True
Name: x, dtype: bool

Upvotes: 0

piRSquared
piRSquared

Reputation: 294328

This is a way to do it:

notnull = df.PRICE.notnull()
protected = df.index > df.PRICE.last_valid_index()

df[notnull | protected]

enter image description here

Upvotes: 1

Related Questions