themaster
themaster

Reputation: 473

Checking for multiple blank fields in a row in pandas

I have to check for blank on multiple column. How can I do that effectively. I am checking for blank on each row of a single column like below -

for item, frame in df['Column_1'].notnull().iteritems():
    if(frame != False):
        print item , frame

I need to have a dataframe where rows on column_1 , column_2 and column_3 are not blank.

Upvotes: 1

Views: 2141

Answers (1)

jezrael
jezrael

Reputation: 862751

I believe need to filter columns by subset, then check no missing values and get all Trues rows by DataFrame.all, last filter by boolean indexing:

cols = ['Column_1','Column_2','Column_3']
df1 = df[df[cols].notnull().all(axis=1)]

Better is use dropna with parameter subset:

df1 = df.dropna(subset=cols)

Sample:

df = pd.DataFrame({'A':list('abcdef'),
                   'Column_1':[np.nan,5,4,5,5,4],
                   'Column_2':[np.nan,8,np.nan,4,2,3],
                   'Column_3':[np.nan,3,5,np.nan,1,0],
                   'E':[5,3,6,9,2,np.nan],
                   'F':list('aaabbb')})

print (df)
   A  Column_1  Column_2  Column_3    E  F
0  a       NaN       NaN       NaN  5.0  a
1  b       5.0       8.0       3.0  3.0  a
2  c       4.0       NaN       5.0  6.0  a
3  d       5.0       4.0       NaN  9.0  b
4  e       5.0       2.0       1.0  2.0  b
5  f       4.0       3.0       0.0  NaN  b

cols = ['Column_1','Column_2','Column_3']
df1 = df.dropna(subset=cols)
print (df1)
   A  Column_1  Column_2  Column_3    E  F
1  b       5.0       8.0       3.0  3.0  a
4  e       5.0       2.0       1.0  2.0  b
5  f       4.0       3.0       0.0  NaN  b

Performance:

In smaller DataFrame is faster boolean indexing solution, but it also depends of number of missing values.

graph

Code:

np.random.seed(123)

import perfplot

cols = ['Column_1','Column_2','Column_3']

def boolean_indexing(df):
    df1 = df[df[cols].notnull().all(axis=1)]
    return df1

def dropna_func(df):
    df1 = df.dropna(subset=cols)
    return df1

def make_df(n):
    df = pd.DataFrame({'Column_1':np.random.choice([1,2,3,np.nan], size=n),
                       'Column_2':np.random.choice([1,2,3,4,5,6,7,8,np.nan], size=n),
                       'Column_3':np.random.choice([1,2,3,4,8,0,10,11,5,6,np.nan], size=n),
                       'E':np.random.choice([1,2,3,np.nan], size=n)})
    return df

perfplot.show(
    setup=make_df,
    kernels=[boolean_indexing, dropna_func],
    n_range=[2**k for k in range(2, 20)],
    logx=True,
    logy=True,
    equality_check=False,
    xlabel='len(df)')

Upvotes: 1

Related Questions