Reputation: 18830
I have a data frame that has 100 columns where the first column is a unique identifier and the second column is the group that they belong to and rest of the column explains each user's characteristics.
user_id, group_id, a, b, c,....,az
0001, 1, 23, pro, 5.5, ......, 234
0002, 1, 32, che, 3.3, ......, 342
0003, 2, NaN, NaN, NaN,......., NaN
I want to subset all the records that have NaN
(s) all the columns except for column 1 and column 2 (user_id
, group_id
)
One way to do is:
df[df['a'].notnull() & (df['b'].notnull()) & (df['c'].notnull())]
but writing for 98 columns like this seems inneficient. Is there a betterway?
Upvotes: 1
Views: 3369
Reputation: 323326
I will using dropna
+ thresh
, basically it adding the tolerance for Number of NaNs per row.
df.dropna(axis=0,thresh=df.shape[1]-2)
Out[101]:
user_id group_id a b c az
0 1 1 23 pro 5.5 234
1 2 1 32 che 3.3 342
Upvotes: 3
Reputation: 210882
In [197]: df[df.iloc[:, 2:].notna().all(1)]
Out[197]:
user_id group_id a b c az
0 1 1 23.0 pro 5.5 234.0
1 2 1 32.0 che 3.3 342.0
if user_id
and group_id
are always set and can never be NaN's:
In [205]: df[df.notna().sum(1).gt(2)]
Out[205]:
user_id group_id a b c az
0 1 1 23.0 pro 5.5 234.0
1 2 1 32.0 che 3.3 342.0
PS older Pandas versions don't have DataFrame.notna()
method - in this case use DataFrame..notnull()
method
UPDATE:
to select rows where all columns except first two have all NAN
values:
In [215]: df[df.iloc[:, 2:].isnull().all(1)]
Out[215]:
user_id group_id a b c az
2 3 2 NaN NaN NaN NaN
Upvotes: 6
Reputation: 215047
You might drop the user_id
and group_id
columns first, test the nulls, then subset:
df[df.drop(['user_id', 'group_id'], axis=1).notnull().all(axis=1)]
# user_id group_id a b c az
#0 1 1 23.0 pro 5.5 234.0
#1 2 1 32.0 che 3.3 342.0
Or if the two columns' positions are fixed at 0 and 1:
df[df.iloc[:,2:].notnull().all(axis=1)]
Upvotes: 4