How can I select out columns where the first values are NaN?

Question

Here is a sample of some data I am working with.

             A      B        C
2014-01-01  4072.0  9871.0  NaN
2014-02-01  NaN 9948.0  NaN
2014-03-01  NaN 10248.0 NaN
2014-04-01  NaN 9872.0  NaN
2014-05-01  NaN 12438.0 NaN
2014-06-01  NaN 10588.0 NaN
2014-07-01  NaN 8718.0  NaN
2014-08-01  NaN 10378.0 NaN
2014-09-01  NaN 9563.0  NaN
2014-10-01  NaN 10669.0 NaN
2014-11-01  NaN 9843.0  NaN
2014-12-01  NaN 9837.0  NaN
2015-01-01  NaN 8606.0  NaN
2015-02-01  NaN 10458.0 NaN
2015-03-01  NaN 9351.0  NaN
2015-04-01  NaN 8705.0  NaN
2015-05-01  NaN 11887.0 NaN
2015-06-01  NaN 8979.0  NaN
2015-07-01  NaN 8373.0  NaN
2015-08-01  NaN 10206.0 NaN
2015-09-01  NaN 9672.0  NaN
2015-10-01  NaN 10351.0 NaN
2015-11-01  NaN 8482.0  808.0
2015-12-01  NaN 7987.0  7691.0
2016-01-01  NaN 7881.0  8327.0
2016-02-01  NaN 7418.0  8220.0
2016-03-01  NaN 6324.0  9086.0
2016-04-01  NaN 3617.0  8362.0
2016-05-01  NaN 39.0    13298.0
2016-06-01  NaN 0.0 13408.0
2016-07-01  NaN NaN 16140.0
2016-08-01  NaN NaN 14520.0
2016-09-01  NaN NaN 14800.0
2016-10-01  NaN NaN 15407.0
2016-11-01  NaN NaN 15812.0
2016-12-01  NaN NaN 2017.0

Some of the columns (like A and B) have a non-nans in the first couple of rows. Some columns (like C) have non nans in the last couple of rows.

I'm interested in removing columns like C. How can I slice these out?

jezrael · Accepted Answer

You can use boolean mask with first row of data selected by iloc and notnull, last select by loc because select by boolean indexing columns:

print (df.iloc[0].notnull())
A     True
B     True
C    False
Name: 2014-01-01, dtype: bool

print (df.loc[:, df.iloc[0].notnull()])
                A        B
2014-01-01  4072.0   9871.0
2014-02-01     NaN   9948.0
2014-03-01     NaN  10248.0
2014-04-01     NaN   9872.0
2014-05-01     NaN  12438.0
2014-06-01     NaN  10588.0
2014-07-01     NaN   8718.0
2014-08-01     NaN  10378.0
2014-09-01     NaN   9563.0
2014-10-01     NaN  10669.0
2014-11-01     NaN   9843.0
2014-12-01     NaN   9837.0
2015-01-01     NaN   8606.0

Another solution:

print (df[df.columns[df.iloc[0].notnull()]])

Timings:

#maxu solution
In [216]: %timeit (df[df.columns[df.notnull().iloc[0]]])
1000 loops, best of 3: 995 µs per loop

In [217]: %timeit (df.loc[:, df.iloc[0].notnull()])
1000 loops, best of 3: 635 µs per loop

In [218]: %timeit (df[df.columns[df.iloc[0].notnull()]])
1000 loops, best of 3: 820 µs per loop

#[360000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
print (df)

#maxu solution
In [233]: %timeit (df[df.columns[df.notnull().iloc[0]]])
100 loops, best of 3: 7.07 ms per loop

In [234]: %timeit (df.loc[:, df.iloc[0].notnull()])
100 loops, best of 3: 4.14 ms per loop

In [235]: %timeit (df[df.columns[df.iloc[0].notnull()]])
100 loops, best of 3: 4.3 ms per loop

How can I select out columns where the first values are NaN?

Answers (2)

Related Questions