Reputation: 85
Hi I am new to pandas and struggling with a manipulation. I have a dataframe df with a huge number of columns, and I only want to keep the number of columns that have a count of above 5000 values.
I tried the loop below but it does not work. Is there any easy way to do this? Also is there a function I could create to apply this to any dataframe where I want to keep columns with only n values or more?
for column in df.columns:
if df[column].count() > 5000:
column = column
else:
df[column].drop()
Thanks
Upvotes: 3
Views: 2392
Reputation: 42946
We can use DataFrame.dropna
which has the argument thresh
, for example:
import pandas as pd
import numpy as np
# example dataframe
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, np.nan],
'C': [np.nan, np.nan, 6],
'D': [np.nan, np.nan, np.nan]
})
A B C D
0 1 4.0 NaN NaN
1 2 5.0 NaN NaN
2 3 NaN 6.0 NaN
We set the threshold to 2
, in your case it is 5000
:
df.dropna(thresh=2, axis=1)
A B
0 1 4.0
1 2 5.0
2 3 NaN
Notice column C
and D
dropped because they had less than 2 non-Na values
Upvotes: 4
Reputation: 10624
Try this:
newdf=df.copy()
for column in newdf.columns:
if df[column].count() <= 5000:
df=df.drop(column, axis=1)
or the equivalent:
newdf=df.copy()
for column in newdf.columns:
if df[column].count() <= 5000:
del df.column
Upvotes: 0