Reputation: 1845
I need to create a data frame with 10 columns (floating numbers) and I need to make sure that each row has 5 Nan values.
Data Frame Which I want to create
A B C D E F G H I J
1.0 Nan 2.0 Nan Nan Nan Nan 5.0 6.0 7.0
Nan Nan Nan 3.0 5.0 Nan Nan 5.0 6.0 7.0
1.0 2.0 3.0 5.0 8.0 Nan Nan Nan Nan Nan
1.0 Nan 3.0 Nan 8.0 10.0 Nan 12.0 Nan Nan
I would like to create this type of data set where each row has 5 NAN values and 5 valid values. I would like to return the column value which has 3rd occurrence of Nan values for each row in a series.
Expected Output
E (it has 3rd occurrence of Nan value in 1st row)
C (it has 3rd occurrence of Nan value in 2nd row)
H (it has 3rd occurrence of Nan value in 3rd row)
G (it has 3rd occurrence of Nan value in 4th row)
Upvotes: 1
Views: 183
Reputation: 17538
Just for curiosity, I %timeit
the two slightly different approaches by @Wen and @HaleemurAli for time performance perspective:
In [69]: df_cumsum = df.isna().cumsum(1) # The common base
In [70]: %timeit df_cumsum == 3
310 µs ± 7.89 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [71]: %timeit df_cumsum.eq(3) # WIN by slight advantage
123 µs ± 2.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [72]: df_locate = df.isna().cumsum(1).eq(3) # To find the index
In [73]: %timeit df_locate.idxmax(axis=1)
206 µs ± 8.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [74]: %timeit np.argmax(df_locate.values, 1) # WIN by enormous advantage
9.63 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Upvotes: 0
Reputation: 28303
use isnull
to find all rows that are null, cumsum
with axis=1
to incrementally count them, filter where the null-count equals 3 & use idxmax
with axis=1
to get the column name.
(df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)
you can create a random dataframe with 5 values and 5 nulls using the following helper function. note, i used randn
, thus the values will be floats from a standard normal distribution, you can replace with another random distribution of your choice
import string
import numpy as np
from numpy.random import permutation, randn
def get_matrix(rows, vals):
return [permutation(np.append(randn(vals), [np.nan]*(vals))) for _ in range(rows)]
df = pd.DataFrame(
get_matrix(4,5), list(string.ascii_uppercase[:2*5])
)
Upvotes: 1
Reputation: 323326
Using cumsum
with argmax
df.columns[np.argmax(df.isnull().cumsum(1).eq(3).values,1)]
Out[788]: Index(['E', 'C', 'H', 'G'], dtype='object')
For dataframe creation
df=pd.DataFrame(np.random.randn(4, 10),columns=list('ABCDEFGHIJ'))
for x in range(len(df)):
df.iloc[x,np.random.choice(10, 5, replace=False)]=np.nan
df
Out[783]:
A B C D E F G H \
0 1.263644 NaN -0.427018 NaN NaN 0.160732 0.033323 -1.285068
1 NaN 2.713568 -0.964603 1.456543 NaN NaN 0.201837 1.034501
2 NaN NaN NaN -0.262311 NaN 0.361472 -0.089562 0.478207
3 NaN 1.497916 -0.324090 NaN NaN NaN 0.711363 -0.094587
I J
0 NaN NaN
1 NaN NaN
2 NaN 0.944062
3 NaN -0.298129
Upvotes: 3