atomsmasher
atomsmasher

Reputation: 745

faster way to replacing specific values per column, python

I have a large-ish structure in as a pandas dataframe, shape = (2000, 200000) I want to replace all of the values of 2 in each column with that particular columns mean (excluding the 2 values). This is how I do it for small structures, but for larger ones it takes a significantly longer time. Y is the DataFrame.

for i in Y.columns:
    x = Y[i][Y[i] != 2].mean()
    Y[i][Y[i] == 2] = x

Is there a faster solution?

Upvotes: 2

Views: 849

Answers (2)

piRSquared
piRSquared

Reputation: 294348

Assuming your dataframe is df, Let:

a = df.values

I'd calculate what the average is across non-twos with

avg = (a.sum(0) - (a == 2).sum(0) * 2.) / (a != 2).sum(0)

(a == 2).sum(0) is the number of 2s in each column.

Then use np.where to select between a and avg. Wrap it in a pd.DataFrame constructor.

pd.DataFrame(np.where(a != 2, a, avg), df.index, df.columns)

Upvotes: 1

Divakar
Divakar

Reputation: 221584

Since you are working with arithmetic operations, I would suggest offloading all those computations to NumPy, get the final result and create a dataframe, like so -

# Extract into an array
arr = Y.values

# Mask to set or not-set elements in array
mask = arr!=2

# Compute the mean vaalues for masked elements along each column
avg = np.true_divide((arr*mask).sum(0),mask.sum(0))

# Finally choose values based on mask and create output dataframe
Yout = pd.DataFrame(np.where(~mask,avg,arr),columns=Y.columns)

For further performance boost, you can probably replace (arr*mask).sum(0) with np.einsum('ij,ij->j',arr,mask) as np.einsum is very efficient in most cases.

Verify results with a sample run -

In [128]: Y = pd.DataFrame(np.random.randint(0,4,(10,5)))

In [129]: arr = Y.values
     ...: mask = arr!=2
     ...: avg = np.true_divide((arr*mask).sum(0),mask.sum(0))
     ...: Yout = pd.DataFrame(np.where(~mask,avg,arr),columns=Y.columns)
     ...: 

In [130]: for i in Y.columns:
     ...:     x = Y[i][Y[i] != 2].mean()
     ...:     Y[i][Y[i] == 2] = x
     ...:     

In [131]: np.allclose(Y,Yout)
Out[131]: True

Upvotes: 3

Related Questions