Reputation: 89

Remove Certain rows above NAN values

I have 100K row data with 4 columns. One of the column contain lots of NAN values. I want to remove rows with NAN and 10 rows below the NAN values.

Upvotes: 1

Answers (2)

piRSquared

Reputation: 294258

Borrowing jezrael's setup

df = pd.DataFrame(np.arange(120).reshape(30,4),columns=list('ABCD'))
df.ix[2:5,'A'] = np.nan
df.ix[20:23,'A'] = np.nan

use pd.concat and any

# grab null rows
isnull = df.A.isnull()

# get subsequent 10 rows after null
mask = pd.concat([isnull.shift(i) for i in range(11)], axis=1).any(1)

# boolean mask
df[~mask]

Upvotes: 1

jezrael

Reputation: 862611

You can use boolean indexing with mask, which count rows after NaN values:

df = pd.DataFrame(np.arange(120).reshape(30,4),columns=list('ABCD'))
df.ix[2:5,'A'] = np.nan
df.ix[20:23,'A'] = np.nan
print (df)
        A    B    C    D
0     0.0    1    2    3
1     4.0    5    6    7
2     NaN    9   10   11
3     NaN   13   14   15
4     NaN   17   18   19
5     NaN   21   22   23
6    24.0   25   26   27
7    28.0   29   30   31
8    32.0   33   34   35
9    36.0   37   38   39
10   40.0   41   42   43
11   44.0   45   46   47
12   48.0   49   50   51
13   52.0   53   54   55
14   56.0   57   58   59
15   60.0   61   62   63
16   64.0   65   66   67
17   68.0   69   70   71
18   72.0   73   74   75
19   76.0   77   78   79
20    NaN   81   82   83
21    NaN   85   86   87
22    NaN   89   90   91
23    NaN   93   94   95
24   96.0   97   98   99
25  100.0  101  102  103
26  104.0  105  106  107
27  108.0  109  110  111
28  112.0  113  114  115
29  116.0  117  118  119

a = df.A.notnull()
b = (a.cumsum()-a.cumsum().where(~a).ffill().fillna(0)).where(df.A.isnull().cumsum() != 0)
print (b)
0      NaN
1      NaN
2      0.0
3      0.0
4      0.0
5      0.0
6      1.0
7      2.0
8      3.0
9      4.0
10     5.0
11     6.0
12     7.0
13     8.0
14     9.0
15    10.0
16    11.0
17    12.0
18    13.0
19    14.0
20     0.0
21     0.0
22     0.0
23     0.0
24     1.0
25     2.0
26     3.0
27     4.0
28     5.0
29     6.0
Name: A, dtype: float64

#get rows without 5 below NaN and without NaN in column A
print (df[b > 5])
        A      B      C      D
11   44.0   45.0   46.0   47.0
12   48.0   49.0   50.0   51.0
13   52.0   53.0   54.0   55.0
14   56.0   57.0   58.0   59.0
15   60.0   61.0   62.0   63.0
16   64.0   65.0   66.0   67.0
17   68.0   69.0   70.0   71.0
18   72.0   73.0   74.0   75.0
19   76.0   77.0   78.0   79.0
29  116.0  117.0  118.0  119.0

Upvotes: 1

Remove Certain rows above NAN values

Answers (2)

Related Questions