How to replace duplicate dataframe column values with certain conditions in python

Question

I have a dataframe with shape (10x401) having duplicate columns with same column names and values. Some of them have nulls while other have numeric values. The columns names are not in sorted order. A short example of dataframe is given below:

       ID#,       1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,.........,100,  100, 100, 100
   
        1,         ,   ,   ,   ,  3,  3,  3,  3,   ,   ,   ,   ,.........,  0,    0,   0,   0   
        2,        0,  0,  0,  0,   ,   ,   ,   , 10, 10, 10, 10,.........,   ,     ,    ,   
        3,        9,  9,  9,  9,  1,  1,  1,  1,  4,  4,  4,  4,.........,  1,    1,   1,   1
        .
        .
        .
       10,         ,   ,   ,   ,   ,   ,    ,  ,   ,   ,    ,   ,........., 6,    6,   6,   6

By ignoring the null values, i need to replace each first occurrence of the numeric value (from 0 to 10) with 1 and the rest of the values with -1 for all 10 rows and 400 columns ignoring the ID column. The resulting dataframe will look like:

       ID#,       1,  1,  1,  1,  2,  2,  2,  2,  3,  3,  3,  3,.........,100,  100, 100, 100
   
        1,         ,   ,   ,   ,  1, -1, -1, -1,   ,   ,   ,   ,.........,  1,   -1,   -1, -1   
        2,        1, -1, -1, -1,   ,   ,   ,   ,  1, -1, -1, -1,.........,   ,     ,     ,   
        3,        1, -1, -1, -1,  1, -1, -1, -1,  1, -1, -1, -1,.........,  1,   -1,   -1, -1
        .
        .
        .
       10,         ,   ,   ,   ,   ,   ,    ,  ,   ,   ,    ,   ,........., 1,   -1,   -1, -1

I will be thankful for some help here.

flom · Accepted Answer

First, some example data:

import pandas as pd

from io import StringIO

df_string = '''
ID;1;1;1;1;2;2;2;2;3;3;3;3
1;;;;;3;3;3;3;;;;
2;0;0;0;0;;;;;10;10;10;10
3;9;9;9;9;1;1;1;1;4;4;4;4
4;;;;;;;;;6;6;6;6
'''

df = pd.read_csv(StringIO(df_string), sep = ";", index_col="ID")

# Removing the automatically added .1/.2/... suffixes. You don't need that for your data.
df.columns = df.columns.str[0]

      1    1    1    1    2    2    2    2     3     3     3     3
ID                                                                
1   NaN  NaN  NaN  NaN  3.0  3.0  3.0  3.0   NaN   NaN   NaN   NaN
2   0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN  10.0  10.0  10.0  10.0
3   9.0  9.0  9.0  9.0  1.0  1.0  1.0  1.0   4.0   4.0   4.0   4.0
4   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   6.0   6.0   6.0   6.0

I recommend transposing the DataFrame, as it's more convenient to use the vectorized methods from pandas. Most of them can be used "horizontally" with specifying the axis=1.

df = df.T

ID    1     2    3    4
1   NaN   0.0  9.0  NaN
1   NaN   0.0  9.0  NaN
1   NaN   0.0  9.0  NaN
1   NaN   0.0  9.0  NaN
2   3.0   NaN  1.0  NaN
2   3.0   NaN  1.0  NaN
2   3.0   NaN  1.0  NaN
2   3.0   NaN  1.0  NaN
3   NaN  10.0  4.0  6.0
3   NaN  10.0  4.0  6.0
3   NaN  10.0  4.0  6.0
3   NaN  10.0  4.0  6.0

First you need to know all the cells where there are values:

ValueMask = ~df.isna()

ID      1      2     3      4
1   False   True  True  False
1   False   True  True  False
1   False   True  True  False
1   False   True  True  False
2    True  False  True  False
2    True  False  True  False
2    True  False  True  False
2    True  False  True  False
3   False   True  True   True
3   False   True  True   True
3   False   True  True   True
3   False   True  True   True

Secondly, you need to know all the starting positions of a new group. Shifting the whole DataFrame down by one row and checking for unequality helps. Combining that with your ValueMask gives you the starting cells:

StartMask = (df.shift() != df) & ValueMask

ID      1      2      3      4
1   False   True   True  False
1   False  False  False  False
1   False  False  False  False
1   False  False  False  False
2    True  False   True  False
2   False  False  False  False
2   False  False  False  False
2   False  False  False  False
3   False   True   True   True
3   False  False  False  False
3   False  False  False  False
3   False  False  False  False

Now you can set all the cells of value to -1 and afterwards all the cells that are a the start of a group to 1

df[ValueMask] = -1
df[StartMask] = 1

ID    1    2    3    4
1   NaN  1.0  1.0  NaN
1   NaN -1.0 -1.0  NaN
1   NaN -1.0 -1.0  NaN
1   NaN -1.0 -1.0  NaN
2   1.0  NaN  1.0  NaN
2  -1.0  NaN -1.0  NaN
2  -1.0  NaN -1.0  NaN
2  -1.0  NaN -1.0  NaN
3   NaN  1.0  1.0  1.0
3   NaN -1.0 -1.0 -1.0
3   NaN -1.0 -1.0 -1.0
3   NaN -1.0 -1.0 -1.0

Now you can always transpose it back:

df = df.T

      1    1    1    1    2    2    2    2    3    3    3    3
ID                                                            
1   NaN  NaN  NaN  NaN  1.0 -1.0 -1.0 -1.0  NaN  NaN  NaN  NaN
2   1.0 -1.0 -1.0 -1.0  NaN  NaN  NaN  NaN  1.0 -1.0 -1.0 -1.0
3   1.0 -1.0 -1.0 -1.0  1.0 -1.0 -1.0 -1.0  1.0 -1.0 -1.0 -1.0
4   NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0 -1.0 -1.0 -1.0

How to replace duplicate dataframe column values with certain conditions in python

Answers (1)

Related Questions