lstyls
lstyls

Reputation: 173

Memory explosion with boolean indexing in Pandas

I am working with a very large data series of floats in Pandas 12.0. What I am trying to do is set extreme outliers to NaNs in this series, which represents a standardized feature vector (mean is 0, std is 1).

I have no trouble making a boolean mask of the feature vector to find extreme outliers:

mask = feature_series > 10 | feature_series < 10

This takes minimal resources. However, when I attempt to actually use this mask I get a memory explosion and have to force exit before a crash occurs. This happens with:

feature_series[mask] = np.nan

It's not limited to this operation either. I also get a memory explosion with:

mask.any()

What's making this happen? I feel like it may be a bug, but I'm still relatively new to Pandas and can't be sure.

Upvotes: 1

Views: 214

Answers (1)

behzad.nouri
behzad.nouri

Reputation: 77971

probably you need some parentheses

mask = (feature_series > 10) | (feature_series < 10)

Upvotes: 2

Related Questions