Reputation: 17
I have a dataframe with shape (10x401) having duplicate columns with same column names and values. Some of them have nulls while other have numeric values. The columns names are not in sorted order. A short example of dataframe is given below:
ID#, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,.........,100, 100, 100, 100
1, , , , , 3, 3, 3, 3, , , , ,........., 0, 0, 0, 0
2, 0, 0, 0, 0, , , , , 10, 10, 10, 10,........., , , ,
3, 9, 9, 9, 9, 1, 1, 1, 1, 4, 4, 4, 4,........., 1, 1, 1, 1
.
.
.
10, , , , , , , , , , , , ,........., 6, 6, 6, 6
By ignoring the null values, i need to replace each first occurrence of the numeric value (from 0 to 10) with 1 and the rest of the values with -1 for all 10 rows and 400 columns ignoring the ID column. The resulting dataframe will look like:
ID#, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,.........,100, 100, 100, 100
1, , , , , 1, -1, -1, -1, , , , ,........., 1, -1, -1, -1
2, 1, -1, -1, -1, , , , , 1, -1, -1, -1,........., , , ,
3, 1, -1, -1, -1, 1, -1, -1, -1, 1, -1, -1, -1,........., 1, -1, -1, -1
.
.
.
10, , , , , , , , , , , , ,........., 1, -1, -1, -1
I will be thankful for some help here.
Upvotes: 0
Views: 242
Reputation: 71
First, some example data:
import pandas as pd
from io import StringIO
df_string = '''
ID;1;1;1;1;2;2;2;2;3;3;3;3
1;;;;;3;3;3;3;;;;
2;0;0;0;0;;;;;10;10;10;10
3;9;9;9;9;1;1;1;1;4;4;4;4
4;;;;;;;;;6;6;6;6
'''
df = pd.read_csv(StringIO(df_string), sep = ";", index_col="ID")
# Removing the automatically added .1/.2/... suffixes. You don't need that for your data.
df.columns = df.columns.str[0]
1 1 1 1 2 2 2 2 3 3 3 3
ID
1 NaN NaN NaN NaN 3.0 3.0 3.0 3.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 NaN NaN NaN NaN 10.0 10.0 10.0 10.0
3 9.0 9.0 9.0 9.0 1.0 1.0 1.0 1.0 4.0 4.0 4.0 4.0
4 NaN NaN NaN NaN NaN NaN NaN NaN 6.0 6.0 6.0 6.0
I recommend transposing the DataFrame, as it's more convenient to use the vectorized methods from pandas. Most of them can be used "horizontally" with specifying the axis=1
.
df = df.T
ID 1 2 3 4
1 NaN 0.0 9.0 NaN
1 NaN 0.0 9.0 NaN
1 NaN 0.0 9.0 NaN
1 NaN 0.0 9.0 NaN
2 3.0 NaN 1.0 NaN
2 3.0 NaN 1.0 NaN
2 3.0 NaN 1.0 NaN
2 3.0 NaN 1.0 NaN
3 NaN 10.0 4.0 6.0
3 NaN 10.0 4.0 6.0
3 NaN 10.0 4.0 6.0
3 NaN 10.0 4.0 6.0
First you need to know all the cells where there are values:
ValueMask = ~df.isna()
ID 1 2 3 4
1 False True True False
1 False True True False
1 False True True False
1 False True True False
2 True False True False
2 True False True False
2 True False True False
2 True False True False
3 False True True True
3 False True True True
3 False True True True
3 False True True True
Secondly, you need to know all the starting positions of a new group. Shifting the whole DataFrame down by one row and checking for unequality helps. Combining that with your ValueMask
gives you the starting cells:
StartMask = (df.shift() != df) & ValueMask
ID 1 2 3 4
1 False True True False
1 False False False False
1 False False False False
1 False False False False
2 True False True False
2 False False False False
2 False False False False
2 False False False False
3 False True True True
3 False False False False
3 False False False False
3 False False False False
Now you can set all the cells of value to -1
and afterwards all the cells that are a the start of a group to 1
df[ValueMask] = -1
df[StartMask] = 1
ID 1 2 3 4
1 NaN 1.0 1.0 NaN
1 NaN -1.0 -1.0 NaN
1 NaN -1.0 -1.0 NaN
1 NaN -1.0 -1.0 NaN
2 1.0 NaN 1.0 NaN
2 -1.0 NaN -1.0 NaN
2 -1.0 NaN -1.0 NaN
2 -1.0 NaN -1.0 NaN
3 NaN 1.0 1.0 1.0
3 NaN -1.0 -1.0 -1.0
3 NaN -1.0 -1.0 -1.0
3 NaN -1.0 -1.0 -1.0
Now you can always transpose it back:
df = df.T
1 1 1 1 2 2 2 2 3 3 3 3
ID
1 NaN NaN NaN NaN 1.0 -1.0 -1.0 -1.0 NaN NaN NaN NaN
2 1.0 -1.0 -1.0 -1.0 NaN NaN NaN NaN 1.0 -1.0 -1.0 -1.0
3 1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0 1.0 -1.0 -1.0 -1.0
4 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 -1.0 -1.0 -1.0
Upvotes: 2