astudentofmaths
astudentofmaths

Reputation: 1163

How to get the first index of a pandas dataframe for which two columns are both not null?

I have a dataframe with several columns. I want to get the first row index for which the values oftwo columns are both not NaN. I know I need to use df.first_valid_index()

example: if my dataframe is:

          Date          A             B     
0   2015-01-02          NaN           1     
1   2015-01-02          NaN           2     
2   2015-01-02          NaN           3     
3   2015-01-02          1            NaN     
5   2015-01-02          NaN           2    
7   2015-01-03          1             1   

I would get 7

Upvotes: 1

Views: 351

Answers (2)

Divakar
Divakar

Reputation: 221614

One approach -

(~(pd.isnull(df.A) | pd.isnull(df.B))).idxmax()

Similar one -

(~pd.isnull(df[['A','B']]).any(1)).idxmax()

For performance, we might want to use argmax -

(~pd.isnull(df[['A','B']]).any(1)).argmax()

Thinking purely on performance, we could bring in more of NumPy -

df.index[(~(np.isnan(df.A.values) | np.isnan(df.B.values))).argmax()]

Sample run -

In [172]: df
Out[172]: 
         Date    A    B
0  2015-01-02  NaN  1.0
1  2015-01-02  NaN  2.0
2  2015-01-02  NaN  3.0
3  2015-01-02  1.0  NaN
5  2015-01-02  NaN  2.0
7  2015-01-03  1.0  1.0

In [173]: (~(pd.isnull(df.A) | pd.isnull(df.B))).idxmax()
Out[173]: 7

In [180]: (~pd.isnull(df[['A','B']]).any(1)).idxmax()
Out[180]: 7

In [182]: (~pd.isnull(df[['A','B']]).any(1)).argmax()
Out[182]: 7

In [258]: df.index[(~(np.isnan(df.A.values) | np.isnan(df.B.values))).argmax()]
Out[258]: 7

Runtime test -

In [259]: a = np.random.rand(100000,2)

In [260]: a[np.random.rand(*a.shape)>0.2] = np.nan

In [261]: df = pd.DataFrame(a, columns=list(('A','B')))

# @jezrael's soln
In [262]: %timeit df[['A','B']].notnull().all(axis=1).idxmax()
100 loops, best of 3: 4.91 ms per loop

In [263]: %timeit (~(pd.isnull(df.A) | pd.isnull(df.B))).idxmax()
     ...: %timeit (~pd.isnull(df[['A','B']]).any(1)).idxmax()
     ...: %timeit (~pd.isnull(df[['A','B']]).any(1)).argmax()
     ...: 
1000 loops, best of 3: 1.37 ms per loop
100 loops, best of 3: 4.73 ms per loop
100 loops, best of 3: 4.74 ms per loop

In [264]: %timeit df.index[(~(np.isnan(df.A.values) | np.isnan(df.B.values))).argmax()]
10000 loops, best of 3: 169 µs per loop

Upvotes: 1

jezrael
jezrael

Reputation: 863166

Use notnull with all for check all True values per row and idxmax:

print (df[['A','B']].notnull())
       A      B
0  False   True
1  False   True
2  False   True
3   True  False
5  False   True
7   True   True

print (df[['A','B']].notnull().all(axis=1))
0    False
1    False
2    False
3    False
5    False
7     True
dtype: bool


val = df[['A','B']].notnull().all(axis=1).idxmax()
print (val)
7

Upvotes: 1

Related Questions