Ruslan
Ruslan

Reputation: 951

Python: Replace dataframe values row by row

I'd like to conditionally replace values, row-by-row in a pandas dataframe so that max(row) will remain, while all other values in the row will be set to None. My intuition goes towards apply() but I am not sure if that's the right choice, or how to do it.

Example (but there may be multiple columns):

tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )

tmp
    A   B
0   1   3
1   2   4
2   3   1
3   4   33
4   5   10
5   6   9
6   7   7
7   8   3
8   9   10
9   10  10

Wanted output:

somemagic(tmp)
    A       B
0   None    3
1   None    4
2   3       None
3   None    33
4   None    10
5   None    9
6   7       None    # on tie I don't really care which one is set to None
7   8       None
8   None    10
9   10      None    # on tie I don't really care which one is set to None

Any suggestions on how to achieve that?

Upvotes: 1

Views: 386

Answers (2)

jezrael
jezrael

Reputation: 862601

You can compared DataFrame values by eq with max:

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])

mask = (tmp.eq(tmp.max(axis=1), axis=0))
print (mask)
       A      B
0  False   True
1  False   True
2   True  False
3  False   True
4  False   True
5  False   True
6   True   True
7   True  False
8  False   True
9   True   True

df = (tmp[mask])
print (df)
      A     B
0   NaN   3.0
1   NaN   4.0
2   3.0   NaN
3   NaN  33.0
4   NaN  10.0
5   NaN   9.0
6   7.0   7.0
7   8.0   NaN
8   NaN  10.0
9  10.0  10.0

and then you can add NaN if values in columns are equal:

mask = (tmp.eq(tmp.max(axis=1), axis=0))
mask['B'] = mask.B & (tmp.A != tmp.B)
print (mask)
       A      B
0  False   True
1  False   True
2   True  False
3  False   True
4  False   True
5  False   True
6   True  False
7   True  False
8  False   True
9   True  False

df = (tmp[mask])
print (df)
      A     B
0   NaN   3.0
1   NaN   4.0
2   3.0   NaN
3   NaN  33.0
4   NaN  10.0
5   NaN   9.0
6   7.0   NaN
7   8.0   NaN
8   NaN  10.0
9  10.0   NaN

Timings (len(df)=10):

In [234]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
1000 loops, best of 3: 974 µs per loop

In [235]: %timeit (gh(tmp))
The slowest run took 4.32 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.64 ms per loop

(len(df)=100k):

In [244]: %timeit (tmp[tmp.eq(tmp.max(axis=1), axis=0)])
100 loops, best of 3: 7.42 ms per loop

In [245]: %timeit (gh(t1))
1 loop, best of 3: 8.81 s per loop

Code for timings:

import pandas as pd

tmp= pd.DataFrame({
'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
} )


tmp = pd.concat([tmp]*10000).reset_index(drop=True)
t1 = tmp.copy()

print (tmp[tmp.eq(tmp.max(axis=1), axis=0)])


def top(row):
    data = row.tolist()
    return [d if d == max(data) else None for d in data]

def gh(tmp1):
    return tmp1.apply(top, axis=1)

print (gh(t1))

Upvotes: 2

Gurupad Hegde
Gurupad Hegde

Reputation: 2155

I would suggest you to use apply(). You can use it as below:

In [1]: import pandas as pd

In [2]: tmp= pd.DataFrame({
   ...: 'A': pd.Series([1,2,3,4,5,6,7,8,9,10], index=range(0,10)),
   ...: 'B': pd.Series([3,4,1,33,10,9,7,3,10,10], index=range(0,10))
   ...: } )

In [3]: tmp
Out[3]: 
    A   B
0   1   3
1   2   4
2   3   1
3   4  33
4   5  10
5   6   9
6   7   7
7   8   3
8   9  10
9  10  10

In [4]: def top(row):
   ...:         data = row.tolist()
   ...:         return [d if d == max(data) else None for d in data]
   ...: 

In [5]: df2 = tmp.apply(top, axis=1)

In [6]: df2
Out[6]: 
    A   B
0 NaN   3
1 NaN   4
2   3 NaN
3 NaN  33
4 NaN  10
5 NaN   9
6   7   7
7   8 NaN
8 NaN  10
9  10  10

Upvotes: 2

Related Questions