faster way to replacing specific values per column, python

Question

I have a large-ish structure in as a pandas dataframe, shape = (2000, 200000) I want to replace all of the values of 2 in each column with that particular columns mean (excluding the 2 values). This is how I do it for small structures, but for larger ones it takes a significantly longer time. Y is the DataFrame.

for i in Y.columns:
    x = Y[i][Y[i] != 2].mean()
    Y[i][Y[i] == 2] = x

Is there a faster solution?

piRSquared · Accepted Answer

Assuming your dataframe is df, Let:

a = df.values

I'd calculate what the average is across non-twos with

avg = (a.sum(0) - (a == 2).sum(0) * 2.) / (a != 2).sum(0)

(a == 2).sum(0) is the number of 2s in each column.

Then use np.where to select between a and avg. Wrap it in a pd.DataFrame constructor.

pd.DataFrame(np.where(a != 2, a, avg), df.index, df.columns)

faster way to replacing specific values per column, python

Answers (2)

Related Questions