Reputation: 9869
Suppose I have a dataframe as shown below:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 0.0
And I have the list of columns which I want to populate with the value of 1, when A
is negative.
idx = df.A < 0
cols = ['B', 'C']
So in this case, I want the indices [1, 'B'] and [4, 'C'] set to 1.
However, doing df.loc[idx, cols] = 1
sets the entire row to be 1, and not just the individual column. I also tried doing df.loc[idx, cols] = pd.get_dummies(cols)
which gave the result:
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 1.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 NaN NaN
I'm assuming this is because the index of get_dummies
and the dataframe don't line up.
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
So what's the best (read fastest) way to do this. In my case, there are 1000's of rows and 5 columns.
TLDR: editing values directly is faster.
%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1
123 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1
266 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 2
Views: 98
Reputation: 862831
Use numpy indexing for improve performance:
idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415 0. 0. ]
[-0.1382643 1. 0. ]
[ 0.64768854 0. 0. ]
[ 1.52302986 0. 0. ]
[-0.23415337 0. 1. ]]
df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Alternative:
idx = df.A < 0
res = ['B', 'C']
df.values[idx, df.columns.get_indexer(res)] = 1
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Upvotes: 3
Reputation: 423
ind = df.index[idx]
for idx,col in zip(ind,res):
...: df.at[idx,col] = 1
In [7]: df
Out[7]:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Upvotes: 1