Reputation: 745
I have a large-ish structure in as a pandas dataframe, shape = (2000, 200000) I want to replace all of the values of 2 in each column with that particular columns mean (excluding the 2 values). This is how I do it for small structures, but for larger ones it takes a significantly longer time. Y
is the DataFrame.
for i in Y.columns:
x = Y[i][Y[i] != 2].mean()
Y[i][Y[i] == 2] = x
Is there a faster solution?
Upvotes: 2
Views: 849
Reputation: 294348
Assuming your dataframe is df
, Let:
a = df.values
I'd calculate what the average is across non-twos with
avg = (a.sum(0) - (a == 2).sum(0) * 2.) / (a != 2).sum(0)
(a == 2).sum(0)
is the number of 2
s in each column.
Then use np.where
to select between a
and avg
. Wrap it in a pd.DataFrame
constructor.
pd.DataFrame(np.where(a != 2, a, avg), df.index, df.columns)
Upvotes: 1
Reputation: 221584
Since you are working with arithmetic operations, I would suggest offloading all those computations to NumPy, get the final result and create a dataframe, like so -
# Extract into an array
arr = Y.values
# Mask to set or not-set elements in array
mask = arr!=2
# Compute the mean vaalues for masked elements along each column
avg = np.true_divide((arr*mask).sum(0),mask.sum(0))
# Finally choose values based on mask and create output dataframe
Yout = pd.DataFrame(np.where(~mask,avg,arr),columns=Y.columns)
For further performance boost, you can probably replace (arr*mask).sum(0)
with np.einsum('ij,ij->j',arr,mask)
as np.einsum
is very efficient in most cases.
Verify results with a sample run -
In [128]: Y = pd.DataFrame(np.random.randint(0,4,(10,5)))
In [129]: arr = Y.values
...: mask = arr!=2
...: avg = np.true_divide((arr*mask).sum(0),mask.sum(0))
...: Yout = pd.DataFrame(np.where(~mask,avg,arr),columns=Y.columns)
...:
In [130]: for i in Y.columns:
...: x = Y[i][Y[i] != 2].mean()
...: Y[i][Y[i] == 2] = x
...:
In [131]: np.allclose(Y,Yout)
Out[131]: True
Upvotes: 3