add-semi-colons
add-semi-colons

Reputation: 18830

Pandas better way to get rows that has all columns null except one

I have a data frame that has 100 columns where the first column is a unique identifier and the second column is the group that they belong to and rest of the column explains each user's characteristics.

user_id, group_id, a, b, c,....,az
0001, 1, 23, pro, 5.5, ......, 234
0002, 1, 32, che, 3.3, ......, 342
0003, 2, NaN, NaN, NaN,......., NaN

I want to subset all the records that have NaN(s) all the columns except for column 1 and column 2 (user_id, group_id)

One way to do is: df[df['a'].notnull() & (df['b'].notnull()) & (df['c'].notnull())] but writing for 98 columns like this seems inneficient. Is there a betterway?

Upvotes: 1

Views: 3369

Answers (3)

BENY
BENY

Reputation: 323326

I will using dropna + thresh, basically it adding the tolerance for Number of NaNs per row.

df.dropna(axis=0,thresh=df.shape[1]-2)
Out[101]: 
   user_id   group_id    a     b     c    az
0        1          1   23   pro   5.5   234
1        2          1   32   che   3.3   342

Upvotes: 3

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

In [197]: df[df.iloc[:, 2:].notna().all(1)]
Out[197]:
   user_id  group_id     a    b    c     az
0        1         1  23.0  pro  5.5  234.0
1        2         1  32.0  che  3.3  342.0

if user_id and group_id are always set and can never be NaN's:

In [205]: df[df.notna().sum(1).gt(2)]
Out[205]:
   user_id  group_id     a    b    c     az
0        1         1  23.0  pro  5.5  234.0
1        2         1  32.0  che  3.3  342.0

PS older Pandas versions don't have DataFrame.notna() method - in this case use DataFrame..notnull() method

UPDATE:

to select rows where all columns except first two have all NAN values:

In [215]: df[df.iloc[:, 2:].isnull().all(1)]
Out[215]:
   user_id  group_id   a    b   c  az
2        3         2 NaN  NaN NaN NaN

Upvotes: 6

akuiper
akuiper

Reputation: 215047

You might drop the user_id and group_id columns first, test the nulls, then subset:

df[df.drop(['user_id', 'group_id'], axis=1).notnull().all(axis=1)]

#   user_id  group_id     a    b    c     az
#0        1         1  23.0  pro  5.5  234.0
#1        2         1  32.0  che  3.3  342.0

Or if the two columns' positions are fixed at 0 and 1:

df[df.iloc[:,2:].notnull().all(axis=1)]

Upvotes: 4

Related Questions