Avi
Avi

Reputation: 1845

How do I get the columns based on 3rd occurrence of a value in a given row?

I need to create a data frame with 10 columns (floating numbers) and I need to make sure that each row has 5 Nan values.

Data Frame Which I want to create 

A    B    C     D     E     F     G     H    I    J   
1.0  Nan  2.0   Nan   Nan   Nan   Nan   5.0  6.0  7.0
Nan  Nan  Nan   3.0   5.0   Nan   Nan   5.0  6.0  7.0
1.0   2.0  3.0   5.0   8.0   Nan   Nan   Nan  Nan  Nan
1.0   Nan  3.0   Nan  8.0   10.0  Nan   12.0  Nan  Nan

I would like to create this type of data set where each row has 5 NAN values and 5 valid values. I would like to return the column value which has 3rd occurrence of Nan values for each row in a series.

  Expected Output 
  E (it has 3rd occurrence of Nan value in 1st row) 
  C (it has 3rd occurrence of Nan value in 2nd row)
  H (it has 3rd occurrence of Nan value in 3rd row)
  G (it has 3rd occurrence of Nan value in 4th row)

Upvotes: 1

Views: 183

Answers (3)

YaOzI
YaOzI

Reputation: 17538

Just for curiosity, I %timeit the two slightly different approaches by @Wen and @HaleemurAli for time performance perspective:

In [69]: df_cumsum = df.isna().cumsum(1) # The common base

In [70]: %timeit df_cumsum == 3
310 µs ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [71]: %timeit df_cumsum.eq(3) # WIN by slight advantage
123 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [72]: df_locate = df.isna().cumsum(1).eq(3) # To find the index

In [73]: %timeit df_locate.idxmax(axis=1)
206 µs ± 8.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)    

In [74]: %timeit np.argmax(df_locate.values, 1) # WIN by enormous advantage
9.63 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Upvotes: 0

Haleemur Ali
Haleemur Ali

Reputation: 28303

use isnull to find all rows that are null, cumsum with axis=1 to incrementally count them, filter where the null-count equals 3 & use idxmax with axis=1 to get the column name.

(df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)

you can create a random dataframe with 5 values and 5 nulls using the following helper function. note, i used randn, thus the values will be floats from a standard normal distribution, you can replace with another random distribution of your choice

import string
import numpy as np
from numpy.random import permutation, randn
def get_matrix(rows, vals):
    return [permutation(np.append(randn(vals), [np.nan]*(vals))) for _ in range(rows)]

df = pd.DataFrame(
    get_matrix(4,5), list(string.ascii_uppercase[:2*5])
)

Upvotes: 1

BENY
BENY

Reputation: 323326

Using cumsum with argmax

df.columns[np.argmax(df.isnull().cumsum(1).eq(3).values,1)]
Out[788]: Index(['E', 'C', 'H', 'G'], dtype='object')

For dataframe creation

df=pd.DataFrame(np.random.randn(4, 10),columns=list('ABCDEFGHIJ'))
for x in range(len(df)):
    df.iloc[x,np.random.choice(10, 5, replace=False)]=np.nan
df
Out[783]: 
          A         B         C         D   E         F         G         H  \
0  1.263644       NaN -0.427018       NaN NaN  0.160732  0.033323 -1.285068   
1       NaN  2.713568 -0.964603  1.456543 NaN       NaN  0.201837  1.034501   
2       NaN       NaN       NaN -0.262311 NaN  0.361472 -0.089562  0.478207   
3       NaN  1.497916 -0.324090       NaN NaN       NaN  0.711363 -0.094587   
    I         J  
0 NaN       NaN  
1 NaN       NaN  
2 NaN  0.944062  
3 NaN -0.298129  

Upvotes: 3

Related Questions