sachinruk
sachinruk

Reputation: 9869

pandas assign result from list of columns

Suppose I have a dataframe as shown below:

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  0.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  0.0

And I have the list of columns which I want to populate with the value of 1, when A is negative.

idx = df.A < 0
cols = ['B', 'C']

So in this case, I want the indices [1, 'B'] and [4, 'C'] set to 1.

What I tried:

However, doing df.loc[idx, cols] = 1 sets the entire row to be 1, and not just the individual column. I also tried doing df.loc[idx, cols] = pd.get_dummies(cols) which gave the result:

          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  0.0  1.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  NaN  NaN

I'm assuming this is because the index of get_dummies and the dataframe don't line up.

Expected Output:

          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

So what's the best (read fastest) way to do this. In my case, there are 1000's of rows and 5 columns.

Timing of results:

TLDR: editing values directly is faster.

%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1

123 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1

266 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Upvotes: 2

Views: 98

Answers (2)

jezrael
jezrael

Reputation: 862831

Use numpy indexing for improve performance:

idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415  0.          0.        ]
 [-0.1382643   1.          0.        ]
 [ 0.64768854  0.          0.        ]
 [ 1.52302986  0.          0.        ]
 [-0.23415337  0.          1.        ]]

df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

Alternative:

idx = df.A < 0
res = ['B', 'C']
df.values[idx, df.columns.get_indexer(res)] = 1
print (df)
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

Upvotes: 3

kleerofski
kleerofski

Reputation: 423

ind = df.index[idx]
for idx,col in zip(ind,res):
   ...:     df.at[idx,col] = 1

In [7]: df
Out[7]:
          A    B    C
0  0.496714  0.0  0.0
1 -0.138264  1.0  0.0
2  0.647689  0.0  0.0
3  1.523030  0.0  0.0
4 -0.234153  0.0  1.0

Upvotes: 1

Related Questions