Reputation: 127
I am writting script in Python and I am looking for optimal solution of following problem:
I have big pandas dataframe (at least 100k rows) and if there are rows with the same value in col2 but different value in col3 then I want to change all values in col3 for A
For example:
----------------------
| col1 | col2 | col3 |
----------------------
| a | 1 | A |
----------------------
| b | 2 | A |
----------------------
| c | 2 | B |
----------------------
| d | 2 | B |
----------------------
| e | 3 | B |
----------------------
| f | 3 | B |
----------------------
should look like this:
----------------------
| col1 | col2 | col3 |
----------------------
| a | 1 | A |
----------------------
| b | 2 | A |
----------------------
| c | 2 | A |
----------------------
| d | 2 | A |
----------------------
| e | 3 | B |
----------------------
| f | 3 | B |
----------------------
I solved that problem by sorting dataframe over col2 and iterating over rows, whenever value in col2 changes and in "block" of the same col2 values are different values I change col3 value but this algorithm takes around 60s for 100k rows and I am looking for more sufficient answer.
Upvotes: 1
Views: 99
Reputation: 862691
Use GroupBy.transform
with DataFrameGroupBy.nunique
for test number of unique values and set new values by condition in DataFrame.loc
:
df.loc[df.groupby('col2')['col3'].transform('nunique') != 1, 'col3'] = 'A'
print (df)
col1 col2 col3
0 a 1 A
1 b 2 A
2 c 2 A
3 d 2 A
4 e 3 B
5 f 3 B
Details:
First check number of unique values per groups with transform
for same size Series like original DataFrame:
print (df.groupby('col2')['col3'].transform('nunique'))
0 1
1 2
2 2
3 2
4 1
5 1
Name: col3, dtype: int64
And thentest for not equal:
print (df.groupby('col2')['col3'].transform('nunique') != 1)
0 False
1 True
2 True
3 True
4 False
5 False
Name: col3, dtype: bool
Last overwrite True
rows by value A
.
Upvotes: 2