Reputation: 9345
I have a pandas DataFrame, df
, and I'd like to get the mean for columns 180 through the end (not including the last column), only using the first 100K rows.
If I use the whole DataFrame:
df.mean().isnull().any()
I get False
If I use only the first 100K rows:
train_means = df.iloc[:100000, 180:-1].mean()
train_means.isnull().any()
I get: True
I'm not sure how this is possible, since the second approach is only getting the column means for a subset of the full DataFrame. So if no column in the full DataFrame has a mean of NaN
, I don't see how a column in a subset of the full DataFrame can.
For what it's worth, I ran:
df.columns[df.isna().all()].tolist()
and I get: []
. So I don't think I have any columns where every entry is NaN
(which would cause a NaN
in my train_means
calculation).
Any idea what I'm doing incorrectly?
Thanks!
Upvotes: 1
Views: 142
Reputation: 323226
Try look at
(df.iloc[:100000, 180:-1].isnull().sum()==100000).any()
If this return True
, which mean you have a columns' value is all NaN
in the first 100000 rows
And Now let us explain why you get all notnull
when do the mean
to the whole dataframe , since mean
have skipna
default as True
so it will drop NaN
before mean
Upvotes: 2