Reputation: 89
I have 100K
row data with 4 columns. One of the column contain lots of NAN
values. I want to remove rows with NAN
and 10 rows below the NAN
values.
Upvotes: 1
Views: 57
Reputation: 294258
Borrowing jezrael's setup
df = pd.DataFrame(np.arange(120).reshape(30,4),columns=list('ABCD'))
df.ix[2:5,'A'] = np.nan
df.ix[20:23,'A'] = np.nan
use pd.concat
and any
# grab null rows
isnull = df.A.isnull()
# get subsequent 10 rows after null
mask = pd.concat([isnull.shift(i) for i in range(11)], axis=1).any(1)
# boolean mask
df[~mask]
Upvotes: 1
Reputation: 862611
You can use boolean indexing
with mask, which count rows after NaN
values:
df = pd.DataFrame(np.arange(120).reshape(30,4),columns=list('ABCD'))
df.ix[2:5,'A'] = np.nan
df.ix[20:23,'A'] = np.nan
print (df)
A B C D
0 0.0 1 2 3
1 4.0 5 6 7
2 NaN 9 10 11
3 NaN 13 14 15
4 NaN 17 18 19
5 NaN 21 22 23
6 24.0 25 26 27
7 28.0 29 30 31
8 32.0 33 34 35
9 36.0 37 38 39
10 40.0 41 42 43
11 44.0 45 46 47
12 48.0 49 50 51
13 52.0 53 54 55
14 56.0 57 58 59
15 60.0 61 62 63
16 64.0 65 66 67
17 68.0 69 70 71
18 72.0 73 74 75
19 76.0 77 78 79
20 NaN 81 82 83
21 NaN 85 86 87
22 NaN 89 90 91
23 NaN 93 94 95
24 96.0 97 98 99
25 100.0 101 102 103
26 104.0 105 106 107
27 108.0 109 110 111
28 112.0 113 114 115
29 116.0 117 118 119
a = df.A.notnull()
b = (a.cumsum()-a.cumsum().where(~a).ffill().fillna(0)).where(df.A.isnull().cumsum() != 0)
print (b)
0 NaN
1 NaN
2 0.0
3 0.0
4 0.0
5 0.0
6 1.0
7 2.0
8 3.0
9 4.0
10 5.0
11 6.0
12 7.0
13 8.0
14 9.0
15 10.0
16 11.0
17 12.0
18 13.0
19 14.0
20 0.0
21 0.0
22 0.0
23 0.0
24 1.0
25 2.0
26 3.0
27 4.0
28 5.0
29 6.0
Name: A, dtype: float64
#get rows without 5 below NaN and without NaN in column A
print (df[b > 5])
A B C D
11 44.0 45.0 46.0 47.0
12 48.0 49.0 50.0 51.0
13 52.0 53.0 54.0 55.0
14 56.0 57.0 58.0 59.0
15 60.0 61.0 62.0 63.0
16 64.0 65.0 66.0 67.0
17 68.0 69.0 70.0 71.0
18 72.0 73.0 74.0 75.0
19 76.0 77.0 78.0 79.0
29 116.0 117.0 118.0 119.0
Upvotes: 1