Kevin Thompson
Kevin Thompson

Reputation: 2506

pandas DataFrame efficiently remove columns

I have a DataFrame that is 1600 x 115,316 and mostly NaN. I want to remove every column that doesn't have more than two values in it. I have written some code that works but man alive it takes a long time.

for eachColumn in df.columns:
  if np.sum(df[eachColumn]) <= 2:
    df.drop(eachColumn, axis=1, inplace=True)

Is there a better way to accomplish what I'm trying to do here?

Upvotes: 1

Views: 685

Answers (1)

TomAugspurger
TomAugspurger

Reputation: 28956

Here's some sample data:

In [18]: df = pd.DataFrame(np.ones(shape=(10, 2))) * np.nan

In [19]: df.iloc[[2, 3], 0] = 2

So column 0 has 2 non-NaNs. Keep that and drop 1.

In [20]: df.drop(df.columns[(~df.isnull()).sum() < 2], axis=1)
Out[20]: 
    0
0 NaN
1 NaN
2   2
3   2
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN

Read it inside out. .isnull gives a DataFrame of Bools, flip the Trues/Falses with the ~, sum that to get the count of non-null per column, select out the columns where the sum is less than 2 (column 1) and drop that./

Upvotes: 3

Related Questions