How to change value in dataframe cell if there are two different values with the same key

Question

I am writting script in Python and I am looking for optimal solution of following problem:

I have big pandas dataframe (at least 100k rows) and if there are rows with the same value in col2 but different value in col3 then I want to change all values in col3 for A

For example:

----------------------
| col1 | col2 | col3 |
----------------------
|   a  |   1  |   A  |
----------------------
|   b  |   2  |   A  |
----------------------
|   c  |   2  |   B  |
----------------------
|   d  |   2  |   B  |
----------------------
|   e  |   3  |   B  |
----------------------
|   f  |   3  |   B  |
----------------------

should look like this:

----------------------
| col1 | col2 | col3 |
----------------------
|   a  |   1  |   A  |
----------------------
|   b  |   2  |   A  |
----------------------
|   c  |   2  |   A  |
----------------------
|   d  |   2  |   A  |
----------------------
|   e  |   3  |   B  |
----------------------
|   f  |   3  |   B  |
----------------------

I solved that problem by sorting dataframe over col2 and iterating over rows, whenever value in col2 changes and in "block" of the same col2 values are different values I change col3 value but this algorithm takes around 60s for 100k rows and I am looking for more sufficient answer.

jezrael · Accepted Answer

Use GroupBy.transform with DataFrameGroupBy.nunique for test number of unique values and set new values by condition in DataFrame.loc:

df.loc[df.groupby('col2')['col3'].transform('nunique') != 1, 'col3'] = 'A' 
print (df)
  col1  col2 col3
0    a     1    A
1    b     2    A
2    c     2    A
3    d     2    A
4    e     3    B
5    f     3    B

Details:

First check number of unique values per groups with transform for same size Series like original DataFrame:

print (df.groupby('col2')['col3'].transform('nunique'))
0    1
1    2
2    2
3    2
4    1
5    1
Name: col3, dtype: int64

And thentest for not equal:

print (df.groupby('col2')['col3'].transform('nunique') != 1)
0    False
1     True
2     True
3     True
4    False
5    False
Name: col3, dtype: bool

Last overwrite True rows by value A.

How to change value in dataframe cell if there are two different values with the same key

Answers (1)

Related Questions